Spurious correlations in big data
June 1, 2014 7:29 AM   Subscribe

So, "big data" is all the rage among technologists and venture capitalists. Intuition suggests that as the amount of data analyzed increases, so too does the amount of spurious correlations. Have there been any good studies (academic or otherwise) that try to resolve this problem? Or feel free to tell me why my intuition is wrong. Thanks.
posted by dfriedman to Computers & Internet (11 answers total) 30 users marked this as a favorite
Best answer: Wikipedia has a pretty good roundup of this issue, with links to books and studies in the references section.
posted by Salvor Hardin at 7:33 AM on June 1, 2014

Best answer: Can't point to a paper, but hopefully they correct for multiple comparisons. But you do end up with fun stuff like this. Also, people are generally looking for predictive power of a model (and test it via cross validation) rather than just correlation.
posted by supercres at 7:33 AM on June 1, 2014 [1 favorite]

Best answer: I'm not sure that big data increases the probability of spurious correlations necessarily (correlations that are caused by causation from a common third variable). I can't think of any reason that would be. What big data does increase is the number of meaningless and/or fluke correlations.

Just like with little data, at .05 1/20 are alpha errors. The thing is of course, that the bigger the data the smaller the p-value. Things that don't even make the p is less than .05 cutoff with 1000 people are p is less than .000 with a million people. The reason is that p values are a function of sample size and effect size. So they key is do the same thing with big data that we should be doing with little data: Stop worshiping the stars (the little stars in the table); stop treating statistical significance like it tells you how much a variable (oohh p is .0000000 this has a big effect!) and start looking at effect sizes themselves.

Statistical significance is meant to tell you how likely it is that a pattern exists in an infinite size population from which the same is drawn. However, as you know, even random arrangements show patterns. So even if there is no alpha error (it really is true that in the whole country people with more split ends are .0000061% more likely to use yellow post-it notes instead of blue) that doesn't mean the pattern means anything. I mean wouldn't it be freaky if the exact same proportion of split-end people and non-split end people used yellow post-it notes? How likely would that be? Of course they're different! That they're different (all the significance test tells you) doesn't matter.

What matters is that the difference is .0000061%. That means that 48.345% of non split end people use yellow post-it notes and 48.3450061 of split-end people do. The significance test says it's true in the population, but who really cares about an effect size that small? Not even 3M gives a crap about this. This isn't just a big data solution but something that needs to be done with all data: Statistical significance is not substantive significance. This is lesson one in any stats class on inference (making judgements about populations from samples): Don't imagine that it matters just because it's statistically significant. It's less of a danger with smaller data because small effects are less likely to have small p-values there, but even in small data, this matters.
posted by If only I had a penguin... at 8:10 AM on June 1, 2014 [4 favorites]

Best answer: Nate Silver's book, Signal and Noise, discusses this.
posted by oceano at 11:28 AM on June 1, 2014

Best answer: This has been discussed in the context of A/B testing on websites. See Evan Miller's How Not to Run an A/B Test and this post on the blue.

Recent discussions in the scientific community around p-hacking may also be of interest.

Slightly farther afield, you might enjoy this April blog post: The Control Group is Out of Control.
posted by zachlipton at 12:47 PM on June 1, 2014

Best answer: That wikipedia page that Salvor Hardin linked to is good. The key thing to keep in mind is that you will invariably make a Type I Error if you look at a large enough number of correlations. What can do, however, is control the ratio of true detections to false detections when looking at a big data set. There's further description here.
posted by zscore at 1:03 PM on June 1, 2014

Best answer: Not an answer as such, but lack of meaningful correlations is also a curse of dimensionality.
posted by ElliotH at 1:49 PM on June 1, 2014

Best answer: Echoing the recommendation of The Signal and the Noise. Silver draws on a lot of fields for his examples--one* that really stuck out to me was that for a long time, there was a ~95% correlation between whether an AL or NL team won the World Series and whether the stock market was rising or falling that year. One would have to be superstitious and/or dumb to believe they're related, but people did.

*I may be getting the sport and the indicator wrong, but it's in the ballpark.
posted by psoas at 2:54 PM on June 1, 2014

Best answer: That's the Super Bowl indicator.
posted by dfan at 4:24 PM on June 1, 2014 [1 favorite]

Best answer: "Machine Learning" is happy when the results are better than random. "Statistics" isn't happy until the null hypothesis can be safely rejected. These are different sets of requirements, and most uses fall into the "machine learning" bucket.

Also, people mean different things by "Big Data". Sometimes they mean lots of dense single-variable data (a really accurate timeseries and the like). Sometimes they mean lots of high-dimensional data (e.g. what Facebook knows about its users). The second is quite vulnerable to generating meaningless correlations ( Spurious Correlations ) while the first is actually vulnerable to rejecting all proposed models, because all models are slightly wrong and with enough data their wrongness becomes an unignorable statistical fact.

People worry about this, but many (most?) usages of "big data" are with the aim of making a decision, rather than proving a hypothesis. So users of these methods are generally okay as long as they result is better than random. This approach makes trained statisticians angry, but it's still how the methods are used.
posted by pmb at 5:45 PM on June 1, 2014

Response by poster: Thanks for all these insightful answers.
posted by dfriedman at 4:46 AM on June 2, 2014

« Older How can I maintain sanity while I await my horny...   |   Vitamins for trees? Newer »
This thread is closed to new comments.