Correlation vs Causation

📈 Data Analysis ⏱️ 30 sec read

The Key Difference

Correlation: Two variables move together (relationship exists)

Causation: One variable directly causes changes in the other

Critical point: Correlation does NOT prove causation!

Classic Example

Ice cream sales and drowning deaths are correlated.

❌ Wrong conclusion: Ice cream causes drowning

✅ Right explanation: Both increase in summer (hidden variable: temperature)

Types of Relationships

A causes B: Smoking → Lung cancer
B causes A: (less common, but possible)
C causes both A and B: Summer → Ice cream sales + Drownings
Coincidence: No real relationship, just random
Reverse causation: You thought A caused B, but B causes A

Spurious Correlation Examples

# Example 1: Nicolas Cage movies and pool drownings
# Highly correlated (0.66) but obviously no causation

# Example 2: Number of films Nicolas Cage appears in
# correlates with swimming pool drownings
# This is random coincidence

# Example 3: Divorce rate in Maine
# correlates with per capita margarine consumption
# Pure coincidence, no causation

Calculating Correlation in Python

import numpy as np
from scipy.stats import pearsonr
import pandas as pd

# Create sample data
ice_cream_sales = [100, 200, 300, 400, 500]
drownings = [10, 18, 25, 32, 40]

# Calculate Pearson correlation
correlation, p_value = pearsonr(ice_cream_sales, drownings)
print(f"Correlation: {correlation:.3f}")  # 0.997
print(f"P-value: {p_value:.4f}")           # 0.0002

# Strong correlation, but no causation!

# Pandas approach
df = pd.DataFrame({
    'ice_cream': ice_cream_sales,
    'drownings': drownings
})
correlation_matrix = df.corr()
print(correlation_matrix)

How to Establish Causation

1. Randomized Controlled Trial (RCT): Gold standard

# Randomly assign treatment and control groups
# This eliminates confounding variables

import random
users = list(range(1000))
random.shuffle(users)

control_group = users[:500]      # Don't get treatment
treatment_group = users[500:]    # Get treatment

# Measure outcome in both groups
# If treatment group different → causation

2. Temporal Ordering: Cause must come before effect

# Does A happen before B?
# If B happens first, A can't cause B

# Check timestamps
if timestamp_A < timestamp_B:
    print("A could cause B (time-wise)")
else:
    print("A cannot cause B")

3. Control for Confounders: Rule out hidden variables

# Example: Does education cause higher income?
# Confounders: family background, intelligence, location

# Statistical control (regression)
from sklearn.linear_model import LinearRegression

# Include potential confounders
X = df[['education', 'family_income', 'IQ', 'location']]
y = df['income']

model = LinearRegression()
model.fit(X, y)

# If education coefficient still significant after
# controlling for confounders → stronger case for causation

Bradford Hill Criteria (Observational Studies)

When experiments aren't possible, use these criteria:

Strength: Stronger correlation = more likely causal
Consistency: Same result across multiple studies?
Specificity: Does A only affect B, or everything?
Temporality: Does A come before B?
Dose-response: More A = more B?
Plausibility: Is there a believable mechanism?
Experiment: Can you manipulate A and see B change?

Common Pitfalls

Confounding Variables

# Example: Coffee and heart disease
# Confounders: smoking (smokers drink more coffee)
# Must control for smoking to see true relationship

Reverse Causation

# Does depression cause poor sleep?
# Or does poor sleep cause depression?
# Could be bidirectional!

Selection Bias

# Gym memberships correlate with fitness
# But: Fit people more likely to join gyms
# Selection bias confuses the relationship

Real-World Example: Does Exercise Cause Weight Loss?

# Observational study (correlation only):
# People who exercise weigh less
# But: Already-healthy people exercise more

# RCT study (tests causation):
# Randomly assign people to exercise or not
# Control diet and other factors
# Measure weight change
# Now we can claim causation!

# Python simulation
import numpy as np

# RCT simulation
control = np.random.normal(0, 2, 100)      # No exercise, no change
treatment = np.random.normal(-5, 2, 100)  # Exercise, -5 lbs average

from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(control, treatment)

if p_value < 0.05:
    print("Exercise causes weight loss!")

Visual: Correlation Matrix

import seaborn as sns
import matplotlib.pyplot as plt

# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix (not causation!)')
plt.show()

Red Flags for False Causation

Correlation found by testing hundreds of variables
No plausible mechanism explaining the relationship
Correlation disappears when controlling for confounders
Time order is wrong (effect before cause)
Sample size is very small

Best Practices

Always ask: "What else could explain this relationship?"
Look for confounders: Hidden variables affecting both
Check temporal order: Does cause come before effect?
Use causal language carefully: Say "associated with" not "causes"
Design experiments: RCTs when possible

Key Takeaways:

Correlation ≠ Causation (memorize this!)
Correlation shows relationship, not direction or mechanism
Establish causation with RCTs, temporal order, and controlling confounders
Always look for hidden variables that could explain both
Be skeptical of causal claims without experimental evidence