Correlation vs Causation
The Key Difference
Correlation: Two variables move together (relationship exists)
Causation: One variable directly causes changes in the other
Critical point: Correlation does NOT prove causation!
Classic Example
Ice cream sales and drowning deaths are correlated.
❌ Wrong conclusion: Ice cream causes drowning
✅ Right explanation: Both increase in summer (hidden variable: temperature)
Types of Relationships
- A causes B: Smoking → Lung cancer
- B causes A: (less common, but possible)
- C causes both A and B: Summer → Ice cream sales + Drownings
- Coincidence: No real relationship, just random
- Reverse causation: You thought A caused B, but B causes A
Spurious Correlation Examples
# Example 1: Nicolas Cage movies and pool drownings
# Highly correlated (0.66) but obviously no causation
# Example 2: Number of films Nicolas Cage appears in
# correlates with swimming pool drownings
# This is random coincidence
# Example 3: Divorce rate in Maine
# correlates with per capita margarine consumption
# Pure coincidence, no causation
Calculating Correlation in Python
import numpy as np
from scipy.stats import pearsonr
import pandas as pd
# Create sample data
ice_cream_sales = [100, 200, 300, 400, 500]
drownings = [10, 18, 25, 32, 40]
# Calculate Pearson correlation
correlation, p_value = pearsonr(ice_cream_sales, drownings)
print(f"Correlation: {correlation:.3f}") # 0.997
print(f"P-value: {p_value:.4f}") # 0.0002
# Strong correlation, but no causation!
# Pandas approach
df = pd.DataFrame({
'ice_cream': ice_cream_sales,
'drownings': drownings
})
correlation_matrix = df.corr()
print(correlation_matrix)
How to Establish Causation
1. Randomized Controlled Trial (RCT): Gold standard
# Randomly assign treatment and control groups
# This eliminates confounding variables
import random
users = list(range(1000))
random.shuffle(users)
control_group = users[:500] # Don't get treatment
treatment_group = users[500:] # Get treatment
# Measure outcome in both groups
# If treatment group different → causation
2. Temporal Ordering: Cause must come before effect
# Does A happen before B?
# If B happens first, A can't cause B
# Check timestamps
if timestamp_A < timestamp_B:
print("A could cause B (time-wise)")
else:
print("A cannot cause B")
3. Control for Confounders: Rule out hidden variables
# Example: Does education cause higher income?
# Confounders: family background, intelligence, location
# Statistical control (regression)
from sklearn.linear_model import LinearRegression
# Include potential confounders
X = df[['education', 'family_income', 'IQ', 'location']]
y = df['income']
model = LinearRegression()
model.fit(X, y)
# If education coefficient still significant after
# controlling for confounders → stronger case for causation
Bradford Hill Criteria (Observational Studies)
When experiments aren't possible, use these criteria:
- Strength: Stronger correlation = more likely causal
- Consistency: Same result across multiple studies?
- Specificity: Does A only affect B, or everything?
- Temporality: Does A come before B?
- Dose-response: More A = more B?
- Plausibility: Is there a believable mechanism?
- Experiment: Can you manipulate A and see B change?
Common Pitfalls
Confounding Variables
# Example: Coffee and heart disease
# Confounders: smoking (smokers drink more coffee)
# Must control for smoking to see true relationship
Reverse Causation
# Does depression cause poor sleep?
# Or does poor sleep cause depression?
# Could be bidirectional!
Selection Bias
# Gym memberships correlate with fitness
# But: Fit people more likely to join gyms
# Selection bias confuses the relationship
Real-World Example: Does Exercise Cause Weight Loss?
# Observational study (correlation only):
# People who exercise weigh less
# But: Already-healthy people exercise more
# RCT study (tests causation):
# Randomly assign people to exercise or not
# Control diet and other factors
# Measure weight change
# Now we can claim causation!
# Python simulation
import numpy as np
# RCT simulation
control = np.random.normal(0, 2, 100) # No exercise, no change
treatment = np.random.normal(-5, 2, 100) # Exercise, -5 lbs average
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(control, treatment)
if p_value < 0.05:
print("Exercise causes weight loss!")
Visual: Correlation Matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix (not causation!)')
plt.show()
Red Flags for False Causation
- Correlation found by testing hundreds of variables
- No plausible mechanism explaining the relationship
- Correlation disappears when controlling for confounders
- Time order is wrong (effect before cause)
- Sample size is very small
Best Practices
- Always ask: "What else could explain this relationship?"
- Look for confounders: Hidden variables affecting both
- Check temporal order: Does cause come before effect?
- Use causal language carefully: Say "associated with" not "causes"
- Design experiments: RCTs when possible
Key Takeaways:
- Correlation ≠ Causation (memorize this!)
- Correlation shows relationship, not direction or mechanism
- Establish causation with RCTs, temporal order, and controlling confounders
- Always look for hidden variables that could explain both
- Be skeptical of causal claims without experimental evidence