Hypothesis Testing: A Step-by-Step Guide

⏱️ 32 sec read 📈 Data Analysis

Hypothesis testing determines if observed differences in data are statistically significant or just random chance. It's fundamental for A/B testing, experiments, and data-driven decision making.

The 5 Steps of Hypothesis Testing

Step 1: State the Hypotheses

# Null Hypothesis (H0): No effect, no difference
# Alternative Hypothesis (H1): There is an effect/difference

Example: Testing new website design
H0: New design has same conversion rate as old design
H1: New design has different conversion rate

# Two-tailed: Different (higher OR lower)
# One-tailed: Specifically higher (or lower)

Step 2: Choose Significance Level (α)

# Alpha = probability of rejecting H0 when it's true (Type I error)
# Common values: 0.05 (5%), 0.01 (1%)

alpha = 0.05  # 95% confidence level

# Interpretation: Accept 5% chance of false positive

Step 3: Select and Calculate Test Statistic

from scipy import stats
import numpy as np

# T-test for comparing two means
group_a = [23, 25, 27, 24, 26]  # Old design conversions
group_b = [28, 30, 32, 29, 31]  # New design conversions

# Independent samples t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)

print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.3f}")

Step 4: Determine P-Value

# P-value = probability of seeing results this extreme if H0 is true
# Low p-value = unlikely to occur by chance

if p_value < alpha:
    print(f"p-value ({p_value:.3f}) < alpha ({alpha})")
    print("Result is statistically significant")
else:
    print(f"p-value ({p_value:.3f}) >= alpha ({alpha})")
    print("Result is NOT statistically significant")

Step 5: Draw Conclusion

if p_value < alpha:
    print("Reject H0: New design has significantly different conversion")
else:
    print("Fail to reject H0: No significant difference detected")

# Important: "Fail to reject" ≠ "Accept H0"
# We never prove H0 true, just fail to find evidence against it

Common Statistical Tests

T-Test (Compare Two Groups)

# One sample t-test (compare to known value)
stats.ttest_1samp(data, popmean=100)

# Independent t-test (two separate groups)
stats.ttest_ind(group1, group2)

# Paired t-test (before/after on same subjects)
stats.ttest_rel(before, after)

Z-Test (Large Samples, Known Variance)

from statsmodels.stats.weightstats import ztest

# For large samples (n > 30)
z_stat, p_value = ztest(group1, group2)

Chi-Square Test (Categorical Data)

# Test relationship between categorical variables
from scipy.stats import chi2_contingency

# Contingency table
observed = [[10, 20], [30, 40]]
chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.3f}")

Complete Example: A/B Test

import numpy as np
from scipy.stats import ttest_ind

# Scenario: Testing email subject lines
# Control: 1000 emails, 120 opens
# Variant: 1000 emails, 145 opens

control_rate = 120 / 1000  # 12%
variant_rate = 145 / 1000  # 14.5%

# Simulate data for test
np.random.seed(42)
control = np.random.binomial(1, control_rate, 1000)
variant = np.random.binomial(1, variant_rate, 1000)

# Run t-test
t_stat, p_value = ttest_ind(control, variant)

# Interpret
alpha = 0.05
print(f"Control open rate: {control_rate:.1%}")
print(f"Variant open rate: {variant_rate:.1%}")
print(f"Difference: {(variant_rate - control_rate):.1%}")
print(f"P-value: {p_value:.3f}")

if p_value < alpha:
    print("✓ Statistically significant - Use new subject line!")
else:
    print("✗ Not significant - Keep testing")

Understanding Errors

Type I Error (False Positive)

# Rejecting H0 when it's actually true
# Controlled by alpha (significance level)
# Example: Saying new design works when it doesn't

Type II Error (False Negative)

# Failing to reject H0 when it's actually false
# Probability = Beta (β)
# Power = 1 - β (ability to detect true effect)
# Example: Missing that new design actually works better

Power Analysis

from statsmodels.stats.power import ttest_power

# Calculate required sample size
effect_size = 0.3  # Expected difference in std deviations
alpha = 0.05
power = 0.8  # 80% chance to detect effect if it exists

sample_size = ttest_power(effect_size, power, alpha)
print(f"Need {sample_size:.0f} samples per group")

Common Pitfalls

P-hacking: Testing multiple hypotheses until finding significance
Sample size: Too small = miss real effects
Confusing practical vs statistical: Significant doesn't always mean important
Multiple comparisons: Need Bonferroni correction

Best Practices

Define hypotheses BEFORE collecting data
Use appropriate test for your data type
Check assumptions (normality, equal variance)
Consider both statistical and practical significance
Report effect size, not just p-value

Pro Tip: A significant p-value doesn't mean the effect is large or important! Always report effect size and confidence intervals alongside p-values for complete context.

← Back to Data Analysis Tips