Useful Data Tips

Correlation vs Causation

📈 Data Analysis ⏱️ 30 sec read

The Key Difference

Correlation: Two variables move together (relationship exists)

Causation: One variable directly causes changes in the other

Critical point: Correlation does NOT prove causation!

Classic Example

Ice cream sales and drowning deaths are correlated.

Wrong conclusion: Ice cream causes drowning

Right explanation: Both increase in summer (hidden variable: temperature)

Types of Relationships

Spurious Correlation Examples

# Example 1: Nicolas Cage movies and pool drownings
# Highly correlated (0.66) but obviously no causation

# Example 2: Number of films Nicolas Cage appears in
# correlates with swimming pool drownings
# This is random coincidence

# Example 3: Divorce rate in Maine
# correlates with per capita margarine consumption
# Pure coincidence, no causation

Calculating Correlation in Python

import numpy as np
from scipy.stats import pearsonr
import pandas as pd

# Create sample data
ice_cream_sales = [100, 200, 300, 400, 500]
drownings = [10, 18, 25, 32, 40]

# Calculate Pearson correlation
correlation, p_value = pearsonr(ice_cream_sales, drownings)
print(f"Correlation: {correlation:.3f}")  # 0.997
print(f"P-value: {p_value:.4f}")           # 0.0002

# Strong correlation, but no causation!

# Pandas approach
df = pd.DataFrame({
    'ice_cream': ice_cream_sales,
    'drownings': drownings
})
correlation_matrix = df.corr()
print(correlation_matrix)

How to Establish Causation

1. Randomized Controlled Trial (RCT): Gold standard

# Randomly assign treatment and control groups
# This eliminates confounding variables

import random
users = list(range(1000))
random.shuffle(users)

control_group = users[:500]      # Don't get treatment
treatment_group = users[500:]    # Get treatment

# Measure outcome in both groups
# If treatment group different → causation

2. Temporal Ordering: Cause must come before effect

# Does A happen before B?
# If B happens first, A can't cause B

# Check timestamps
if timestamp_A < timestamp_B:
    print("A could cause B (time-wise)")
else:
    print("A cannot cause B")

3. Control for Confounders: Rule out hidden variables

# Example: Does education cause higher income?
# Confounders: family background, intelligence, location

# Statistical control (regression)
from sklearn.linear_model import LinearRegression

# Include potential confounders
X = df[['education', 'family_income', 'IQ', 'location']]
y = df['income']

model = LinearRegression()
model.fit(X, y)

# If education coefficient still significant after
# controlling for confounders → stronger case for causation

Bradford Hill Criteria (Observational Studies)

When experiments aren't possible, use these criteria:

Common Pitfalls

Confounding Variables

# Example: Coffee and heart disease
# Confounders: smoking (smokers drink more coffee)
# Must control for smoking to see true relationship

Reverse Causation

# Does depression cause poor sleep?
# Or does poor sleep cause depression?
# Could be bidirectional!

Selection Bias

# Gym memberships correlate with fitness
# But: Fit people more likely to join gyms
# Selection bias confuses the relationship

Real-World Example: Does Exercise Cause Weight Loss?

# Observational study (correlation only):
# People who exercise weigh less
# But: Already-healthy people exercise more

# RCT study (tests causation):
# Randomly assign people to exercise or not
# Control diet and other factors
# Measure weight change
# Now we can claim causation!

# Python simulation
import numpy as np

# RCT simulation
control = np.random.normal(0, 2, 100)      # No exercise, no change
treatment = np.random.normal(-5, 2, 100)  # Exercise, -5 lbs average

from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(control, treatment)

if p_value < 0.05:
    print("Exercise causes weight loss!")

Visual: Correlation Matrix

import seaborn as sns
import matplotlib.pyplot as plt

# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix (not causation!)')
plt.show()

Red Flags for False Causation

Best Practices

Key Takeaways:

  • Correlation ≠ Causation (memorize this!)
  • Correlation shows relationship, not direction or mechanism
  • Establish causation with RCTs, temporal order, and controlling confounders
  • Always look for hidden variables that could explain both
  • Be skeptical of causal claims without experimental evidence