Hypothesis Testing: A Step-by-Step Guide
Hypothesis testing determines if observed differences in data are statistically significant or just random chance. It's fundamental for A/B testing, experiments, and data-driven decision making.
The 5 Steps of Hypothesis Testing
Step 1: State the Hypotheses
# Null Hypothesis (H0): No effect, no difference
# Alternative Hypothesis (H1): There is an effect/difference
Example: Testing new website design
H0: New design has same conversion rate as old design
H1: New design has different conversion rate
# Two-tailed: Different (higher OR lower)
# One-tailed: Specifically higher (or lower)
Step 2: Choose Significance Level (α)
# Alpha = probability of rejecting H0 when it's true (Type I error)
# Common values: 0.05 (5%), 0.01 (1%)
alpha = 0.05 # 95% confidence level
# Interpretation: Accept 5% chance of false positive
Step 3: Select and Calculate Test Statistic
from scipy import stats
import numpy as np
# T-test for comparing two means
group_a = [23, 25, 27, 24, 26] # Old design conversions
group_b = [28, 30, 32, 29, 31] # New design conversions
# Independent samples t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.3f}")
Step 4: Determine P-Value
# P-value = probability of seeing results this extreme if H0 is true
# Low p-value = unlikely to occur by chance
if p_value < alpha:
print(f"p-value ({p_value:.3f}) < alpha ({alpha})")
print("Result is statistically significant")
else:
print(f"p-value ({p_value:.3f}) >= alpha ({alpha})")
print("Result is NOT statistically significant")
Step 5: Draw Conclusion
if p_value < alpha:
print("Reject H0: New design has significantly different conversion")
else:
print("Fail to reject H0: No significant difference detected")
# Important: "Fail to reject" ≠ "Accept H0"
# We never prove H0 true, just fail to find evidence against it
Common Statistical Tests
T-Test (Compare Two Groups)
# One sample t-test (compare to known value)
stats.ttest_1samp(data, popmean=100)
# Independent t-test (two separate groups)
stats.ttest_ind(group1, group2)
# Paired t-test (before/after on same subjects)
stats.ttest_rel(before, after)
Z-Test (Large Samples, Known Variance)
from statsmodels.stats.weightstats import ztest
# For large samples (n > 30)
z_stat, p_value = ztest(group1, group2)
Chi-Square Test (Categorical Data)
# Test relationship between categorical variables
from scipy.stats import chi2_contingency
# Contingency table
observed = [[10, 20], [30, 40]]
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.3f}")
Complete Example: A/B Test
import numpy as np
from scipy.stats import ttest_ind
# Scenario: Testing email subject lines
# Control: 1000 emails, 120 opens
# Variant: 1000 emails, 145 opens
control_rate = 120 / 1000 # 12%
variant_rate = 145 / 1000 # 14.5%
# Simulate data for test
np.random.seed(42)
control = np.random.binomial(1, control_rate, 1000)
variant = np.random.binomial(1, variant_rate, 1000)
# Run t-test
t_stat, p_value = ttest_ind(control, variant)
# Interpret
alpha = 0.05
print(f"Control open rate: {control_rate:.1%}")
print(f"Variant open rate: {variant_rate:.1%}")
print(f"Difference: {(variant_rate - control_rate):.1%}")
print(f"P-value: {p_value:.3f}")
if p_value < alpha:
print("✓ Statistically significant - Use new subject line!")
else:
print("✗ Not significant - Keep testing")
Understanding Errors
Type I Error (False Positive)
# Rejecting H0 when it's actually true
# Controlled by alpha (significance level)
# Example: Saying new design works when it doesn't
Type II Error (False Negative)
# Failing to reject H0 when it's actually false
# Probability = Beta (β)
# Power = 1 - β (ability to detect true effect)
# Example: Missing that new design actually works better
Power Analysis
from statsmodels.stats.power import ttest_power
# Calculate required sample size
effect_size = 0.3 # Expected difference in std deviations
alpha = 0.05
power = 0.8 # 80% chance to detect effect if it exists
sample_size = ttest_power(effect_size, power, alpha)
print(f"Need {sample_size:.0f} samples per group")
Common Pitfalls
- P-hacking: Testing multiple hypotheses until finding significance
- Sample size: Too small = miss real effects
- Confusing practical vs statistical: Significant doesn't always mean important
- Multiple comparisons: Need Bonferroni correction
Best Practices
- Define hypotheses BEFORE collecting data
- Use appropriate test for your data type
- Check assumptions (normality, equal variance)
- Consider both statistical and practical significance
- Report effect size, not just p-value
Pro Tip: A significant p-value doesn't mean the effect is large or important! Always report effect size and confidence intervals alongside p-values for complete context.
← Back to Data Analysis Tips