Do Overlapping Confidence Intervals Mean No Significant Difference?
No. Two 95% confidence intervals can overlap and the difference between the groups can still be statistically significant at p < 0.05. This is one of the most common stats misreadings on dashboards, and it usually points to "let's not roll out the change" when the data actually supports rolling it out. The right answer is to put a CI on the difference and check whether that CI excludes zero.
Why the Intuition Is Wrong
"Non-overlapping CIs implies significant" is true. The reverse isn't. Two 95% CIs can overlap by a bit — roughly up to a quarter of the full CI length (about half of one arm) — and the test for the difference can still be significant.
The reason is that the standard error of a difference is not the sum of the two individual standard errors. It's:
SE(diff) = sqrt(SE_A² + SE_B²)
That square-root combination is smaller than SE_A + SE_B. The CI for the difference is therefore narrower than the visual gap between the two individual CIs would suggest.
A Worked Numeric Example
Two groups, equal sample sizes, normal data:
Group A: mean = 100, SE = 2.0 → 95% CI = [96.1, 103.9]
Group B: mean = 105, SE = 2.0 → 95% CI = [101.1, 108.9]
The intervals overlap from 101.1 to 103.9. Many readers conclude "not significant."
Difference test:
diff = 5.0
SE(diff) = sqrt(2² + 2²) = 2.83
95% CI for diff = 5 ± 1.96 × 2.83 = [-0.55, 10.55] borderline
Check sample sizes — if SE = 1.5 each instead:
CI_A = [97.1, 102.9], CI_B = [102.1, 107.9] (still overlap from 102.1-102.9)
SE(diff) = sqrt(1.5² + 1.5²) = 2.12
CI for diff = [0.85, 9.15] p < 0.05 — significant!
The CIs overlapped, but the difference was significant. The "do the bars touch" eyeball test would have called this a tie.
The 16% Rule (Approximate but Useful)
If both CIs come from samples with similar standard errors, the equivalent of a p ≈ 0.05 difference test is roughly an 83.5% individual CI — not 95%. That's where each CI's half-width is about 1.39 standard errors instead of 1.96.
In practice this means: if you see two 95% CIs whose overlap is less than about 25% of the full length of one interval (about half of one arm), the difference is probably significant at p < 0.05. More than that, probably not. But this is a rule of thumb — it doesn't replace the actual test on the difference.
What to Plot Instead
Skip the side-by-side CIs. Plot the difference directly with its own CI:
Variant A vs Control: diff = +5.2%, 95% CI = [+1.1%, +9.3%] ← significant
Variant B vs Control: diff = +1.8%, 95% CI = [-2.4%, +6.0%] ← not significant
This is unambiguous: if the CI for the difference excludes zero, the difference is significant at the matching alpha. No overlap-eyeballing required.
When Overlap Does Mean "Not Significant"
If the overlap is large — say, the two means each fall inside the other's CI — the difference is almost certainly not significant at 95%. The misconception only bites in the middle range, where there's some overlap but it's small.
Code: CI of the Difference in Python
from scipy import stats
import numpy as np
a = np.array([...]) # group A observations
b = np.array([...]) # group B observations
# Welch's t-test (unequal variances) is the safe default
result = stats.ttest_ind(a, b, equal_var=False)
# 95% CI for the difference of means
diff = a.mean() - b.mean()
se = np.sqrt(a.var(ddof=1)/len(a) + b.var(ddof=1)/len(b))
df = result.df # scipy >= 1.10 returns df
crit = stats.t.ppf(0.975, df)
ci = (diff - crit*se, diff + crit*se)
print(f"diff={diff:.3f}, 95% CI={ci}, p={result.pvalue:.4f}")
Common Pitfalls
- Reporting "no difference" because bars touch: the most common version of this mistake. Always test the difference, not the visuals.
- Using a two-sample test on paired data: for before/after on the same users, use a paired test — it has a much smaller SE and bigger power.
- Multiple comparisons: running 10 difference tests at α=0.05 gives a ~40% chance of at least one false positive. Apply a Bonferroni or FDR correction.
- Equating "not significant" with "no effect": a CI like [-2%, +12%] doesn't show no effect — it shows you don't have enough data to rule one out.
Pro Tip: Build your A/B-testing dashboards to display the CI of the lift (B vs A), not two separate metric CIs. It removes the overlap question entirely and matches how product teams actually decide: "is the lift bounded away from zero?"
← Back to Data Analysis Tips