A/B Testing Explained
What is A/B Testing?
A/B testing (split testing) compares two versions to see which performs better. Randomly assign users to A (control) or B (treatment), measure outcomes, determine if difference is statistically significant.
Simple Example
Test two button colors on your website:
- Version A (control): Blue button → 50 clicks / 1000 visitors (5.0%)
- Version B (treatment): Red button → 65 clicks / 1000 visitors (6.5%)
- Question: Is red really better, or just random luck?
The A/B Testing Process
1. Define Hypothesis
# Null hypothesis (H0): No difference between A and B
# Alternative hypothesis (H1): B is better than A
# Example:
# H0: Red button has same click rate as blue button
# H1: Red button has higher click rate than blue button
2. Choose Metric
- Conversion rate: % who clicked/bought/signed up
- Revenue per user: Average $ per visitor
- Engagement: Time on site, pages viewed
- Primary metric: Main goal you're optimizing
3. Calculate Sample Size
from statsmodels.stats.power import zt_ind_solve_power
# Parameters
baseline_rate = 0.05 # Current conversion rate (5%)
minimum_detectable_effect = 0.01 # Want to detect 1% absolute increase
alpha = 0.05 # Significance level (5%)
power = 0.80 # Power (80% chance to detect real effect)
# Calculate required sample size per group
n = zt_ind_solve_power(
effect_size=(minimum_detectable_effect / baseline_rate),
alpha=alpha,
power=power,
alternative='larger'
)
print(f"Need {n:.0f} samples per group")
# Need ~3,800 per group = 7,600 total visitors
4. Run the Test
import random
def assign_variant(user_id):
"""Randomly assign user to A or B"""
random.seed(user_id) # Consistent assignment per user
return 'A' if random.random() < 0.5 else 'B'
# Track results
results = {
'A': {'visitors': 0, 'conversions': 0},
'B': {'visitors': 0, 'conversions': 0}
}
# For each visitor:
variant = assign_variant(user_id)
results[variant]['visitors'] += 1
if user_converted:
results[variant]['conversions'] += 1
5. Analyze Results
from scipy.stats import chi2_contingency
import numpy as np
# Your data
a_conversions = 50
a_visitors = 1000
b_conversions = 65
b_visitors = 1000
# Create contingency table
observed = [
[a_conversions, a_visitors - a_conversions],
[b_conversions, b_visitors - b_conversions]
]
# Chi-square test
chi2, p_value, dof, expected = chi2_contingency(observed)
# Calculate conversion rates
a_rate = a_conversions / a_visitors
b_rate = b_conversions / b_visitors
lift = (b_rate - a_rate) / a_rate * 100
print(f"A conversion rate: {a_rate:.1%}")
print(f"B conversion rate: {b_rate:.1%}")
print(f"Relative lift: {lift:.1f}%")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("✅ Statistically significant! B wins.")
else:
print("❌ Not significant. Could be random chance.")
Statistical Significance
# Common significance levels:
# p < 0.05 → 95% confidence (common standard)
# p < 0.01 → 99% confidence (more stringent)
# p < 0.10 → 90% confidence (more lenient)
# Example interpretation:
if p_value < 0.05:
print("Less than 5% chance this difference is random")
print("Reject null hypothesis")
print("B is significantly different from A")
Calculating Confidence Intervals
import scipy.stats as stats
def proportion_ci(successes, trials, confidence=0.95):
"""Calculate confidence interval for proportion"""
rate = successes / trials
z = stats.norm.ppf((1 + confidence) / 2)
se = np.sqrt(rate * (1 - rate) / trials)
margin = z * se
return (rate - margin, rate + margin)
# Example
a_ci = proportion_ci(50, 1000)
b_ci = proportion_ci(65, 1000)
print(f"A: {a_ci[0]:.1%} to {a_ci[1]:.1%}")
print(f"B: {b_ci[0]:.1%} to {b_ci[1]:.1%}")
# If confidence intervals don't overlap → significant difference
Common Pitfalls
1. Stopping Too Early (Peeking)
# BAD: Check results every day, stop when p < 0.05
# This inflates false positive rate!
# GOOD: Calculate sample size upfront, wait until reached
target_sample_size = 10000
if current_sample_size >= target_sample_size:
analyze_results() # Only check when planned sample size reached
2. Not Accounting for Multiple Comparisons
# Testing 10 variants? Need Bonferroni correction
num_tests = 10
adjusted_alpha = 0.05 / num_tests # 0.005
# Use adjusted_alpha as significance threshold
3. Ignoring Sample Ratio Mismatch
# Check if A and B got equal traffic
expected_ratio = 0.5
actual_ratio = a_visitors / (a_visitors + b_visitors)
if abs(actual_ratio - expected_ratio) > 0.02:
print("⚠️ Sample ratio mismatch! Check randomization.")
4. Changing Test Mid-Flight
- Don't change variant design during test
- Don't change success metric mid-test
- Don't extend test duration to achieve significance
Advanced: Sequential Testing
# Alternative to fixed-sample testing
# Can check results continuously without inflating false positives
# Libraries:
# - statsmodels (Sequential Probability Ratio Test)
# - scipy (Sequential Analysis)
# Benefit: Can stop test early if clear winner emerges
Real-World Example
# E-commerce checkout test
test_data = {
'control': {
'visitors': 5000,
'purchases': 250,
'revenue': 12500
},
'treatment': {
'visitors': 5000,
'purchases': 285,
'revenue': 14250
}
}
# Calculate metrics
for variant in ['control', 'treatment']:
data = test_data[variant]
conv_rate = data['purchases'] / data['visitors']
revenue_per_visitor = data['revenue'] / data['visitors']
print(f"{variant.title()}:")
print(f" Conversion rate: {conv_rate:.2%}")
print(f" Revenue per visitor: ${revenue_per_visitor:.2f}")
# Statistical test
observed = [
[250, 4750], # control: conversions, non-conversions
[285, 4715] # treatment: conversions, non-conversions
]
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"\nP-value: {p_value:.4f}")
if p_value < 0.05:
print("Winner: Treatment!")
lift = (285/5000 - 250/5000) / (250/5000) * 100
print(f"Lift: {lift:.1f}%")
When to Use A/B Testing
- ✅ Website/app changes (buttons, copy, layouts)
- ✅ Email campaigns (subject lines, send times)
- ✅ Pricing strategies
- ✅ Product features
- ✅ Marketing campaigns
Best Practices
- One change at a time: Test single variable
- Calculate sample size upfront: Don't guess
- Run for full weeks: Account for day-of-week effects
- Randomize properly: Truly random assignment
- Wait for significance: Don't peek early
- Consider practical significance: Is 0.1% lift worth it?
- Document everything: Hypothesis, metrics, results
Tools for A/B Testing
- Google Optimize: Free, integrates with Analytics
- Optimizely: Enterprise platform
- VWO: Visual editor, easy setup
- Python libraries: scipy, statsmodels
- R: pwr package for power analysis
Key Takeaways:
- A/B testing proves causation through randomization
- Calculate required sample size before starting
- p < 0.05: Common threshold for significance
- Don't peek at results early (inflates false positives)
- Always report effect size, not just p-value
- One variable at a time for clear results