Strategies for analyzing/visualizing data points for anomalies
October 9, 2017 12:17 PM   Subscribe

I have a bunch of devices (~100K) that routinely report back their status to a central location. Broadly, the state of the device can be further clarified as either BAD or GOOD. While it is trivial to determine if a device is always good or always bad it is hard to tell how serious a transient amount of bad reports are to be expected. What are some strategies, books, techniques, approaches, etc. I should be investigating to gain insight into this data set.

Roughly the data is in a SQL Server database in the format of:

DeviceName, DateTimeStamp, Status

where each unique device may report back a variable amounts of time per day including zero times. I could transform this into some other format if it facilitated analysis/reporting.

As far as my background goes, I'm a fairly decent python/powershell/C# programmer and I am comfortable/competent with SQL although I've never gotten too deep in data warehousing/analysis services techniques. I know the basics of statistics but its been 15+ years since that class in college and I haven't had to do anything beyond rudimentary things in the intervening years.

I'm open to using/learning new technology to approach this problem whether it be something like R or machine learning, etc. I have a corporate subscription to Safari Books Online so book suggestions from there are more than welcome.
posted by mmascolino to Computers & Internet (10 answers total) 4 users marked this as a favorite
I don't know what I'm talking about, but I do know that a Poisson distribution deals with the "probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known constant rate and independently of the time since the last event." So if you found the Poisson distribution for a normal device, departures from that distribution might tell you that a device is behaving abnormally.

Numpy/scipy have Poisson tools built in.

Hopefully someone who does know what they're talking about chimes in.
posted by clawsoon at 12:49 PM on October 9

Is the goal retrospective (after-the-fact) or online (as-it-happens) anomaly detection? Like, can you look back on a day to determine how many (or what ratio of) BADs constitute a bad sensor, or do you need to know "Okay, there's one bad reading, that's probably okay; whoops, there's another one, this is starting to look anomalous..."?
posted by supercres at 12:53 PM on October 9 [1 favorite]

To answer the basic insight question, this is probably what I'd do to avoid any intermediate data or unneeded languages:

0. Install pandas and get it to talk to the sql server
1. Load in data, making sure the second column is read as a timestamp/datetime, then make a 2-level index of device name + timestamp
2. Change each GOOD to a 0 and BAD to a 1, or vice versa
3. Rolling mean of that binary column over the timestamp column within each device (probably a groupby with a lambda function). You'll have to play with the window length.
4. Figure out what score for the result of (3) constitutes a bad sensor. If you have labels for good vs bad sensor, you could do a histogram of those scores by that label.

Sorry, I stopped linking to the pandas docs above, but can tell you the classes for any of those steps if you decide to go that route.
posted by supercres at 1:05 PM on October 9 [1 favorite]

I would recommend you take a look at apdex , I have seen it used very succesfully. It's a generic index that is easy to calculate (does not require any fancy stats package) and this sort of data is perfect for it. You could calculate a reliability index value for each device, or for the aggregate of all of the devices, or both... then you could track this value over time and with a little history you could calculate some weekly / monthly trends, see what else it might correlate to, etc.
posted by machinecraig at 1:43 PM on October 9

While it is trivial to determine if a device is always good or always bad it is hard to tell how serious a transient amount of bad reports are to be expected.

I don't completely understand what you are saying here, but the word "transient" does prompt me to suggest that you make sure you understand the concept of hysteresis. As a state goes from good to bad, or reverse, there may be a period where it is "flickering", and you want to make sure your system (or data analysis) doesn't mess up in the handling of that.
posted by intermod at 2:08 PM on October 9

To me, this sounds like "statistical process control". You could probably use some of those methods to get to what you are looking for.
posted by TomFoolery at 3:30 PM on October 9 [1 favorite]

At a super simple level,

SELECT SUM(DeviceStatus) / COUNT(DeviceStatus) AS FailureRate
FROM MyTable
GROUP BY DeviceName

If you eyeball this and see clearly that a hundred are in the 70% failure range and the other ninety-nine thousand are in the 1% or less failure range then you've found a hundred bad devices pretty quickly.

There are, as the other comments are pointing out implicitly or explicitly, an almost infinite number of ways to do this with more sophistication (in both theory and with tools) and to detect non-obvious failures and efficiently allocate resources for testing and repair.
posted by mark k at 9:35 PM on October 9

The techniques mentioned so far are good, and a software development pattern you might want to check out is the Circuit Breaker pattern. The basic idea is that nodes in partial failure are seen as in a "half open" state, and there are ways to address that programmatically to determine retries, failure state, etc.
posted by mikeh at 7:31 AM on October 10

Debounce logic is what you're looking for.
posted by euphorb at 9:44 PM on October 10

A hidden Markov model might be a decent way to model this. Basically the idea is that there would be two "hidden" states, "actually good" and "actually bad," with a certain probability of transition between them. Each state also has a certain probability of returning a GOOD or BAD report, so that even the "actually good" state will occasionally emit a BAD report (but the "actually bad" state will emit them a lot more consistently). After you've trained the model you should be able to predict the most likely hidden "actual" state of each machine from the reports.

(Strictly this should probably be some kind of continuous time HMM to account for the variable sampling but you might not need anything that fancy if the sampling is pretty frequent relative to the rate at which units fail.)
posted by en forme de poire at 12:23 AM on October 11

« Older Woolly Hubby   |   Boston to Northampton: Car? Bus? Both? Newer »

You are not logged in, either login or create an account to post comments