A/B Testing Explained

📈 Data Analysis ⏱️ 35 sec read

What is A/B Testing?

A/B testing (split testing) compares two versions to see which performs better. Randomly assign users to A (control) or B (treatment), measure outcomes, determine if difference is statistically significant.

Simple Example

Test two button colors on your website:

Version A (control): Blue button → 50 clicks / 1000 visitors (5.0%)
Version B (treatment): Red button → 65 clicks / 1000 visitors (6.5%)
Question: Is red really better, or just random luck?

The A/B Testing Process

1. Define Hypothesis

# Null hypothesis (H0): No difference between A and B
# Alternative hypothesis (H1): B is better than A

# Example:
# H0: Red button has same click rate as blue button
# H1: Red button has higher click rate than blue button

2. Choose Metric

Conversion rate: % who clicked/bought/signed up
Revenue per user: Average $ per visitor
Engagement: Time on site, pages viewed
Primary metric: Main goal you're optimizing

3. Calculate Sample Size

from statsmodels.stats.power import zt_ind_solve_power

# Parameters
baseline_rate = 0.05      # Current conversion rate (5%)
minimum_detectable_effect = 0.01  # Want to detect 1% absolute increase
alpha = 0.05              # Significance level (5%)
power = 0.80              # Power (80% chance to detect real effect)

# Calculate required sample size per group
n = zt_ind_solve_power(
    effect_size=(minimum_detectable_effect / baseline_rate),
    alpha=alpha,
    power=power,
    alternative='larger'
)

print(f"Need {n:.0f} samples per group")
# Need ~3,800 per group = 7,600 total visitors

4. Run the Test

import random

def assign_variant(user_id):
    """Randomly assign user to A or B"""
    random.seed(user_id)  # Consistent assignment per user
    return 'A' if random.random() < 0.5 else 'B'

# Track results
results = {
    'A': {'visitors': 0, 'conversions': 0},
    'B': {'visitors': 0, 'conversions': 0}
}

# For each visitor:
variant = assign_variant(user_id)
results[variant]['visitors'] += 1
if user_converted:
    results[variant]['conversions'] += 1

5. Analyze Results

from scipy.stats import chi2_contingency
import numpy as np

# Your data
a_conversions = 50
a_visitors = 1000
b_conversions = 65
b_visitors = 1000

# Create contingency table
observed = [
    [a_conversions, a_visitors - a_conversions],
    [b_conversions, b_visitors - b_conversions]
]

# Chi-square test
chi2, p_value, dof, expected = chi2_contingency(observed)

# Calculate conversion rates
a_rate = a_conversions / a_visitors
b_rate = b_conversions / b_visitors
lift = (b_rate - a_rate) / a_rate * 100

print(f"A conversion rate: {a_rate:.1%}")
print(f"B conversion rate: {b_rate:.1%}")
print(f"Relative lift: {lift:.1f}%")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("✅ Statistically significant! B wins.")
else:
    print("❌ Not significant. Could be random chance.")

Statistical Significance

# Common significance levels:
# p < 0.05 → 95% confidence (common standard)
# p < 0.01 → 99% confidence (more stringent)
# p < 0.10 → 90% confidence (more lenient)

# Example interpretation:
if p_value < 0.05:
    print("Less than 5% chance this difference is random")
    print("Reject null hypothesis")
    print("B is significantly different from A")

Calculating Confidence Intervals

import scipy.stats as stats

def proportion_ci(successes, trials, confidence=0.95):
    """Calculate confidence interval for proportion"""
    rate = successes / trials
    z = stats.norm.ppf((1 + confidence) / 2)
    se = np.sqrt(rate * (1 - rate) / trials)
    margin = z * se
    return (rate - margin, rate + margin)

# Example
a_ci = proportion_ci(50, 1000)
b_ci = proportion_ci(65, 1000)

print(f"A: {a_ci[0]:.1%} to {a_ci[1]:.1%}")
print(f"B: {b_ci[0]:.1%} to {b_ci[1]:.1%}")

# If confidence intervals don't overlap → significant difference

Common Pitfalls

1. Stopping Too Early (Peeking)

# BAD: Check results every day, stop when p < 0.05
# This inflates false positive rate!

# GOOD: Calculate sample size upfront, wait until reached
target_sample_size = 10000
if current_sample_size >= target_sample_size:
    analyze_results()  # Only check when planned sample size reached

2. Not Accounting for Multiple Comparisons

# Testing 10 variants? Need Bonferroni correction
num_tests = 10
adjusted_alpha = 0.05 / num_tests  # 0.005
# Use adjusted_alpha as significance threshold

3. Ignoring Sample Ratio Mismatch

# Check if A and B got equal traffic
expected_ratio = 0.5
actual_ratio = a_visitors / (a_visitors + b_visitors)

if abs(actual_ratio - expected_ratio) > 0.02:
    print("⚠️ Sample ratio mismatch! Check randomization.")

4. Changing Test Mid-Flight

Don't change variant design during test
Don't change success metric mid-test
Don't extend test duration to achieve significance

Advanced: Sequential Testing

# Alternative to fixed-sample testing
# Can check results continuously without inflating false positives

# Libraries:
# - statsmodels (Sequential Probability Ratio Test)
# - scipy (Sequential Analysis)

# Benefit: Can stop test early if clear winner emerges

Real-World Example

# E-commerce checkout test
test_data = {
    'control': {
        'visitors': 5000,
        'purchases': 250,
        'revenue': 12500
    },
    'treatment': {
        'visitors': 5000,
        'purchases': 285,
        'revenue': 14250
    }
}

# Calculate metrics
for variant in ['control', 'treatment']:
    data = test_data[variant]
    conv_rate = data['purchases'] / data['visitors']
    revenue_per_visitor = data['revenue'] / data['visitors']

    print(f"{variant.title()}:")
    print(f"  Conversion rate: {conv_rate:.2%}")
    print(f"  Revenue per visitor: ${revenue_per_visitor:.2f}")

# Statistical test
observed = [
    [250, 4750],  # control: conversions, non-conversions
    [285, 4715]   # treatment: conversions, non-conversions
]
chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"\nP-value: {p_value:.4f}")
if p_value < 0.05:
    print("Winner: Treatment!")
    lift = (285/5000 - 250/5000) / (250/5000) * 100
    print(f"Lift: {lift:.1f}%")

When to Use A/B Testing

✅ Website/app changes (buttons, copy, layouts)
✅ Email campaigns (subject lines, send times)
✅ Pricing strategies
✅ Product features
✅ Marketing campaigns

Best Practices

One change at a time: Test single variable
Calculate sample size upfront: Don't guess
Run for full weeks: Account for day-of-week effects
Randomize properly: Truly random assignment
Wait for significance: Don't peek early
Consider practical significance: Is 0.1% lift worth it?
Document everything: Hypothesis, metrics, results

Tools for A/B Testing

Google Optimize: Free, integrates with Analytics
Optimizely: Enterprise platform
VWO: Visual editor, easy setup
Python libraries: scipy, statsmodels
R: pwr package for power analysis

Key Takeaways:

A/B testing proves causation through randomization
Calculate required sample size before starting
p < 0.05: Common threshold for significance
Don't peek at results early (inflates false positives)
Always report effect size, not just p-value
One variable at a time for clear results