Detect Missing Records in Dataset

I have a dataset that contains several measures from various widgets on a daily basis. While the widgets remain relatively stable over time, sometimes there are legitimate reasons for one to disappear and another to appear in the data as a whole. Occasionally, a widget will just disappear and so the dataset is incomplete, invalidating the whole dataset for that day.

What I am looking for is a method of comparing the current set of widgets with another set of widgets to detect if any widgets are missing. I am not trying to create the values, just identify that they are missing. I could do time-series, but that feels like overkill on so many widgets and there are multiple attributes on which data might be missing. I was hoping for something more set based that might account for the regular changes in widgets but detecting the unusual dropouts. I am sure I just need to adjust the way I am thinking about the problem.

Any ideas would be much appreciated?

Topic missing-data time-series machine-learning

Category Data Science


One option is hashing, assigning a numerical value to each widget. The best hashing option is if each widget had a unique id, something like a serial number. If a widget does not have inherent unique, then a hash value can be created by applying a hash function to the features of widget.

After creating hash values for each widget, set comparison can be used to see if any hashed values are different between the two sets of widgets.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.