Correlation Isn't Causation: Be Careful Making These Mistakes

“Curiosity is what draws you out of your comfort zone; fear is what draws you back in.” – Marc Jacobs

THINKING TOOL

selective focus phot of artificial human skull
selective focus phot of artificial human skull

Correlation does not imply causation. This refers to the inability to discern a cause-and-effect relationship between events and variables solely on the basis of observed association, or correlation. The idea that correlation implies causation is a logical fallacy, in which two events occurring together are wrongly taken to have a cause-and-effect relationship. Correlation tests for a relationship between two variables, but just because two variables move together does not tell us whether one causes the other to occur.

A strong correlation can indicate causation, but there are often other explanations. One of which is random chance. Variables can seem related, but there is no underlying relationship. The other is the presence of a third, invisible variable, making the relationship appear stronger than it really is. The idea is that just because there is a pattern in the data, does not imply that the variables have a cause-and-effect relationship. It is entirely possible to find statistically significant and reliable correlation for two variables that are not causally linked at all. In fact, this is very common.

Think about it with an example. Imagine you are observing health data. You see a statistically significant correlation between exercise and skin cancer: that is, those who exercise tend to be people who get skin cancer. This correlation is strong and reliable. It shows up across multiple populations of patients, in large and testable numbers. Without exploring further, you might imply that exercise causes cancer. You could even draw a plausible hypothesis: the stress response induced by physical exertion causes the body to lose its ability to shield itself from ultraviolet rays, resulting in damage to the skin and thus skin cancer.

In reality, the correlation in your dataset only exists because people who live in places where there is a lot of sunlight year round tend to exercise more. They are significantly more active in their daily living than those who live in the shade. In the data, this shows up as increased exercise. Meanwhile, the increase in sun exposure mean they are more prone to skin cancer. Determining causes is never perfect, not in the real world. It is why we conduct randomized-controlled clinical trials and use predictive models with multiple variables. Careful experiment design ensures we don’t fall for the fallacy.

white and black microscope on white surface
white and black microscope on white surface

Real life implications of correlation-causation relationships:

  • Scientific research: academics ensure their studies distinguish between correlation and causation to avoid misleading conclusions, like in studies linking coffee to reduced mortality where controlling for confounding factors like exercise or lifestyle habits is crucial;

  • Health: misinterpreted correlations can result in ineffective treatments or panic, as was the case when early studies suggested hormone replacement therapy as a way to reduce heart disease risk in women, later research revealing that this correlation was simply because healthier women used the therapy;

  • Business: companies analyze data carefully before drawing conclusions and jumping to judgments, like when a retailer notices the sale of umbrellas and boots rising together, instead of implying boots cause umbrella sales, they look to the weather forecast that looks extra rainy;

  • Advertising: marketers distinguish between causation and correlation to avoid wasting their time, money, and effort on ineffective campaigns, like when ad campaigns coincide with increased sales, but deeper analyses reveal that seasonality, not the campaign, drove the uptick;

  • Personal: you can avoid misguided decisions by understanding patterns don’t always imply causation, like when seeing a rise in productivity—it does not mean your new routine caused the improvement, maybe you just had a better night’s rest.

How you might employ correlation-causation as a mental model: (1) investigate confounding variables, as when two things appear related, looking for external factors can explain the connection, such as when two neighborhoods have different crime rates, examining socioeconomic factors before assuming a variable like education to cause the disparity; (2) demand evidence of causation, avoiding jumping to conclusions based on observational data and relying on empirical evidence instead, like looking for controlled studies instead of believing a dietary supplement company’s promises; (3) trust randomized-controlled trials (RCTs), as they are the gold standard for establishing causation, since they isolate variables and reduce bias; (4) beware of the post hoc fallacy, as just because one event follows another does not mean the former caused the latter; (5) consider bidirectional causation, since sometimes causation flows both ways, triggering a feedback loop, as is the case when higher education levels result in better job prospects and vice versa; (6) explore statistical methods like Granger causality tests to distinguish correlation and causation.

Thought-provoking insights. “Correlation is not causation, but it sure is a hint.” reminds us that while correlation does not prove causation, it can guide exploration and hypothesis testing. The third variable problem: correlation often stems from unconsidered variables, which means deeper analysis to uncover true causes is often necessary. Data overload: in the era of big data, spurious correlations are common, thus prompting the need for careful interpretation. “Absence of evidence is not evidence of absence.” is a classic saying warning that, just because causation is not apparent, does not mean it is not there. Rigorous investigation above all else. Rely on robust evidence. Be a calculated skeptic.