AI Question
December 6, 2019 7:06 AM   Subscribe

I've been studying up on AI/ML, and believe I understand a fundamental part. Tell me if I'm on-target: Correlation does not imply causation, but with a truly enormous data set, causation can be beside the point.
posted by Quisp Lover to Computers & Internet (19 answers total) 9 users marked this as a favorite
 
Eh, there’s worse take-home messages, imo. Of course it’s more complicated and the different flavors can be rather different, but when you say ‘causation can be beside the point’, that gets at a one feature and limitation most of these systems have in common: they tend to not illuminate anything so much as spit out answers, which may be useful but are never really functionally supported in any traditionally accepted manner.

This is also why ML/AI are good for making clever appliances and surveillance systems but not really very useful for science or education, where understanding what’s going on is kind of the point.
posted by SaltySalticid at 7:15 AM on December 6, 2019 [9 favorites]


There are plenty of cases in which you don't care whether correlation implies causation: If you're an insurance company, it's valuable to know that people who live in a certain neighborhood have higher accident rates, regardless of whether it's causal. If you are designing a government program that tries to help people by moving them to different neighborhoods, determining causality is crucial.

I don't know what the size of the dataset has to do with that. If you look at data from Atlantic City, NJ, you'll find that monthly sales of ice cream and bathing suits are correlated. If you expand that to every beach town in the world, you'll come to the same conclusion. Neither study tells you anything about causation.
posted by Mr.Know-it-some at 7:39 AM on December 6, 2019 [10 favorites]


There are tons of assumptions in statistics (of which AI/ML is a trendy flavor) that get seriously wonky with conditionals like "when the data set gets large enough".

For the examples Mr KIS mention, it's not just size of dataset, it's diversity. With a data set that includes many examples of all combinations of people who've moved from one neighborhood to the other, in all combinations of socioeconomic and demographic brackets, then you don't need to infer-- you have a full distribution of data describing all situations and can just describe what has happened. (Of course there are time-sensitive variables, so if some microsegment of the population was better off ten years ago that doesn't mean they'll be better off with the same intervention today.) Needless to say, that dataset does not -- can not -- exist; if it did there'd be basically no reason to do the study.

For the bathing suit/ice cream example, the diversity would include beaches where people don't swim but do eat ice cream. The correlation is broken so the causality would lie elsewhere (as it does).

So yes, if you have effectively sampled everything possible, then the causation-correlation distinction goes away. But that's sort of like "as model approaches omniscience" so not a terribly helpful "if".
posted by supercres at 7:52 AM on December 6, 2019


I would say instead that AI/ML is good at finding correlations but fairly useless at determining causation. It’s just not the right tool for that. Causation is on its own axis.
posted by Tell Me No Lies at 7:54 AM on December 6, 2019 [3 favorites]


I would say instead that AI/ML is good at finding correlations but fairly useless at determining causation. It’s just not the right tool for that.

I disagree with this. I don't think it's any worse that statistics writ large. It's just that many people who use it aren't interested in the causal mechanism (though many are).

AI/ML, colloquially, if I'm understanding your context right, generally means "statistics using deep learning". It's a tool, nothing more. It can be misapplied, misused, misinterpreted... but so can generalized linear models ¯\_(ツ)_/¯
posted by supercres at 7:58 AM on December 6, 2019 [1 favorite]


I disagree with this. I don't think it's any worse that statistics writ large. It's just that many people who use it aren't interested in the causal mechanism (though many are).

The problem is that for a machine with 100 inputs it can be extremely difficult to determine how those inputs are being weighted in the output. The results themselves are obvious but the workings of the auto-generated machine are very obscure.

I can see using the generated correlations to help hunt down causation, but extracting that information from the machine itself would be very difficult if not impossible.
posted by Tell Me No Lies at 8:08 AM on December 6, 2019 [4 favorites]


Read Pearl's The Book of Why.

As some have noted, if there's causation, correlation may be besides the point, depending on your purpose.

E.g. when tobacco companies were contesting the hypothesis that smoking causes cancer, in the face of massive evidence of smoking/cancer correlation, they were saying (in bad faith, but still) that reducing smoking would not reduce cancer, since both cancer and smoking were caused by a genetic predisposition. So, for the tobacco companies, for smokers, and for medicine, causality (or not) very much was the point.

On the other hand, I expect that insurance companies seeking to predict cancer risks asked about people's smoking habits early on. In this case, they might not care about causation beyond correlation.
posted by alittleknowledge at 8:24 AM on December 6, 2019 [4 favorites]


Er, first sentence should read: "if there's correlation, causation may be besides the point,"
posted by alittleknowledge at 8:46 AM on December 6, 2019


IME, what AI is best at is finding anomalies - and then alerting system owners to those; with enough experience (the ML part) one can 'train' a system to undertake certain steps to remediate the condition the anomalies alert to. That's just from my little part of the IT world though.
posted by dbmcd at 9:24 AM on December 6, 2019


I would say that it's not the size of the data but the application domain which determines whether causation is important or not. If you're, e.g., recommending movies to people, then you care about the accuracy of the result (correctly recommending movies to people that they will enjoy), but you don't necessarily care about why your model produces accurate results. If you care about "why", you are probably some kind of scientist and not an ML practitioner.

On preview, I'm seconding alittleknowledge's answer and adding the point that size of data is a separate issue.
posted by jomato at 9:25 AM on December 6, 2019


Best answer: For datasets of any size, this is still often true depending on your goals. I guess the dataset needs to be a big enough sample that statistical approximations still work, but that turns out to be quite small, nowhere near "big data" size.

This is particularly clear in medical applications -- imagine you have a small set of only 100 patients and you want to use ML to build a diagnostic decision tree to predict which patients have infections, by observing their vital signs and so on. Diseases are the cause of symptoms, so you're using a correlation and ignoring the direction of causation, but that's fine, it doesn't matter for your purposes.

On the other hand, if you're trying to predict "what would have happened if I had given these patients medicine X", now you're estimating a counterfactual, and ignoring causality can totally screw you up.

Agree with reading Pearl's "The Book of Why". It will introduce the paradigm of Bayesian networks, which are practically useful in their own right as ML models (although sort of out of fashion post-deep-learning) and explain how they connect to causality.

Key takeaway is that while correlation does not imply causation, under most circumstances, causation does imply something about correlation.
posted by vogon_poet at 9:27 AM on December 6, 2019 [5 favorites]


Hi. I'm getting my PhD in statistics right now and I study causal inference. I'll say this: No—even if your dataset is large enough, or even if you've effectively sampled everything possible, you still may not be able to draw causal inferences, particularly for observational data. In some cases, this can also be difficult even when dealing with data generated from randomized experiments (often considered the gold standard of causal evidence) especially if you want to look at effects within certain subgroups (referred to as heterogeneous treatment effects). I'll briefly talk about these issues in an interventional context, e.g. does treatment A work compared to treatment B? Or: does X increase your risk of cancer? It's still a bit early and I haven't had enough coffee yet, but please hear me out.

It's important to understand that AI/ML really do not offer much in the way of tools for inferring causation. Nor do they really do much better at inferring associations compared to more traditional statistical techniques. However, if causation's what you're interested in, the field of statistics also doesn't offer much for you either. A lot of this has to do with tradition: statisticians have historically assumed that your data came from a randomized experiment, or that you otherwise had control over the data-generating process and so could design your experiment in order to answer the question that you're interested in. (Cf. Fisher's field experiments.)

So, historically, in a sense, inference and experimental design have been tightly coupled, and statistical tools were developed solely in order to interpret experimental data. The notion of observational studies didn't really come into vogue until the 1980s, after propensity score matching had been developed and really fleshed out, and when computers made observational data easier to collect, store, and analyze. Around this time, economists started to put out a lot of good work on causal inference and quasi-experimental methods, including instrumental variables, regression discontinuity designs, and difference-in-differences. In addition, this period also saw an increased focus on the value of randomization in medicine, and this is when randomized controlled trials and the whole field of evidence-based medicine really began to take off.

Randomization is very important—it basically bakes causal interpretations into the inferences you make. I can't emphasize that point enough—without randomization, you've got your work cut out for you. If you have an observational dataset, the overriding strategy that you can use to get at causal inferences is to attempt to emulate a randomized trial. You can do this with—for example—propensity score matching, or with instrumental variables. Propensity score matching tries to model the decision-making process behind who gets treatment and who doesn't, or who gets exposed and who doesn't, and uses this model to adjust for selection bias. Sometimes you can't measure all these factors, and so you'd have to resort to instrumental variables, which look at "encouragements" or nudges towards getting the treatment and extract the little bits of randomness out of these encouragements.

Getting back to the question at hand, even with a large (or huge) dataset, your associations are still basically that—associations. You can't interpret them causally, no matter how big your data are. Even if you had data on everyone of interest for your problem (say the entire population of the US), your correlations still don't have causal interpretations. There is no critical point in terms of sample size beyond which your associations transmute into causal effects. And AI/ML, as effective they are for some problems, still don't offer you any way around that fundamental point, nor are they fields where these questions are of primary interest. Of course, more data are always nice. But randomization, or something that looks like it, is even better.
posted by un petit cadeau at 9:37 AM on December 6, 2019 [26 favorites]


Judea Pearl's Turing Award page gives a good summary of his work and gets into some of the correlation/causation stuff, and the contrasts between old-school logic based AI and probabilistic machine learning. (It was written a few years before the deep learning revolution so that's not even on the radar though.)
posted by vogon_poet at 9:38 AM on December 6, 2019


On the other hand, if you're trying to predict "what would have happened if I had given these patients medicine X", now you're estimating a counterfactual, and ignoring causality can totally screw you up.

A number of examples given in answers of cases where the user "wouldn't care" have problems, but, to make it concrete, here's a real-world example of how the kind of "blindness" proposed by the OP can really, really screw up: A Health Care Algorithm Offered Less Care To Black Patients. Various kinds of bias are at least a major cause of a huge range of societal phenomena, so if you're indifferent to causality, you're likely to end up reinforcing bias, now with an extra layer of fake perceived scientific basis!
posted by praemunire at 10:32 AM on December 6, 2019 [6 favorites]


I teach machine learning evaluation. What I tell my students is that the point of machine learning is generalization. Finding a correlation in your training data is useless if it's not something that is going to stand up in your deployment scenario. If it does stand up in deployment, determining causation is not important. In most cases, causation is not a problem that machine learning practitioners intend to solve.

One of the biggest problems right now is that people often set up their machine learning evaluation in various ways that gives them what look like good metrics on their test data but they actually do not stand up in deployment. One way this can happen is that the test data does not really look like the data that will be used in deployment. Another is through overfitting to the test data. Many people know they're not supposed to train on the test data to avoid this, but there are actually many subtle ways to accidentally leak test data into your model without realizing it.

Generalization does not require an enormous dataset, but it helps, especially if you have a complex model with a large number of parameters.
posted by grouse at 1:34 PM on December 6, 2019 [5 favorites]


The history of studies about whether women under 55 should get mammograms - one telling here - is an instructive example of some of the problems that can happen even with huge amounts of data that others have mentioned above.
posted by clawsoon at 2:20 PM on December 6, 2019


...although I'm sure I've read a better telling, which gets more into the data issues, elsewhere.
posted by clawsoon at 2:21 PM on December 6, 2019


Part 4 of Emperor of All Maladies.
posted by clawsoon at 2:28 PM on December 6, 2019


Possibly relevant: A Wired article on why deep learning systems cannot answer reading comprehension questions
posted by meaty shoe puppet at 3:05 PM on December 21, 2019


« Older Environmentalism book recommendations   |   Help me not destroy a great thing! Newer »
This thread is closed to new comments.