How do you use this tool?
- Enter variants — visitors and conversions per arm. Minimum A + B; up to five variants supported.
- Pick confidence 90/95/99%. One-sided only when the direction of the hypothesis was fixed in advance.
- Choose engine: Frequentist gives the p-value, Bayesian shows P(B>A), Peek-Safe (mSPRT) handles tests you have peeked at.
- Heed realism-gate warnings — n<100/variant or zero conversions makes the Z-test unreliable.
- Sample-Size tab plans the test ahead; copy the permalink to share via URL hash without a server.
What does this A/B test significance calculator measure?
You drop two conversion counts into the calculator — visitors and conversions per variant — and get back whether the difference is statistically significant. But “statistically significant” alone is not enough in 2026. This calculator gives four views on the same data:
- p-value (frequentist) — the classical answer: how unlikely is this result if both variants were equally good? A result below α=0.05 counts as significant.
P(B > A)(Bayesian) — the more direct answer: how likely is it that variant B is genuinely better? At 96% the call is clear, even when p sits at the frequentist borderline.- Always-valid p (mSPRT) — the Peek-Safe value for anyone who looked at the running test more than once. Never smaller than the naive Z-test, often more realistic.
- Wilson-score confidence interval — the band the true Δ-rate lies in with 95% probability. A CI that includes zero is the honest version of “not significant”.
With three or more variants a multi-variant table appears automatically — pairwise vs control, with Bonferroni and Holm corrections. The realism-gate banner warns on tiny samples (n<100/variant), low power, MDE above 50%, or sample-ratio mismatch (χ² test on 50/50 split).
Frequentist or Bayesian — which one when?
Frequentist and Bayesian answer different questions. Understanding the distinction leads to better decisions.
The frequentist p-value answers: “Assuming both variants were equally good — how unlikely would this observed result (or more extreme) be?” The threshold α (usually 5%) is the willingness to accept a false-positive. A p of 0.03 does NOT mean variant B is better with 97% probability — that is a common misinterpretation.
The Bayesian posterior P(B > A) answers the question most stakeholders are actually asking:
“How likely is it that B is better?” Using a Beta distribution and Monte-Carlo sampling (50,000
samples, deterministically seeded), we compute the posterior from a uniform Beta(1,1) prior plus
the observed data. The number is directly interpretable.
In practice: when one variant dominates, both views agree. At the borderline (p≈0.05) the Bayesian
view helps enormously: when P(B > A) sits at 95% the case is clear; at 75% the data is
ambivalent. Reading the Bayesian view alongside protects against the most common p-value
misinterpretation.
What is Peek-Safe mSPRT and why do I need it?
Repeatedly peeking at a running A/B test without corrected statistics is one of the most common mistakes in product testing. Anyone who checks every day and stops at the first p<0.05 does NOT have a 5% false-positive rate — they have 25–50%, depending on how often they look.
The phenomenon is the “Sequential Testing Problem” and has been known since the 1940s. The modern fix comes from the Optimizely Stats-Engine paper Johari et al., “Always Valid Inference”, arXiv:1512.04922: the mixture Sequential Probability Ratio Test (mSPRT) gives an always-valid p-value that stays valid under ANY stopping rule. You may peek whenever and as often as you like.
The trade-off: mSPRT is more conservative than the naive Z-test — you need slightly more data for the same significance call. In exchange you get honest results. Toggle “Peek-Safe (mSPRT)” up top and the calculator switches — showing both the always-valid-p and the naive p side by side so the difference is visible. Rule of thumb: looked at the running test more than twice? Use mSPRT.
Realism gate — when should I not even compute?
Statistics can produce very precise numbers for very wrong models. Four standard cases where the math does not match reality — the calculator surfaces a banner in each:
- Sample < 100/variant: the Normal approximation of the Z-test is unreliable. At n=50 the reported p-value can be off by orders of magnitude.
- Power < 80%: a non-significant result is uninformative when the sample was always too small for the hoped-for effect size. Use the Sample-Size tab.
- MDE above 50% relative: anyone hunting a +50% lift is hunting a miracle. Realistic A/B test effects are +1% to +20%; everything above is suspect.
- Conversion rate = 0: the Z-test is mathematically undefined when one variant has zero conversions. The Wilson-score CI gives an upper bound; gather more data.
Plus: with two variants the calculator runs a χ² test on the 50/50 split (critical value ≈ 10.83 at α=0.001). If it fires, you have a sample-ratio mismatch — fix randomisation before believing the p-value.
When do you use Bonferroni versus Holm correction?
Anyone testing three or four variants at once forgets easily that the family-wise error rate climbs. With three comparisons at α=0.05 you have a 14% chance of a false-positive finding somewhere in the family — even if all variants were equally good.
Bonferroni correction divides α by the number of comparisons. With three tests, α=0.0167 per
comparison. Very conservative and very simple. Holm-Bonferroni is uniformly more powerful at the
same FWER control — sort p-values ascending and check stepwise against α/m, α/(m-1),
…, α/1. The first non-significant comparison blocks all subsequent ones.
The calculator shows both corrections side by side so you can see which comparison survives under which method. Rule of thumb: more than two comparisons → Bonferroni is the minimum, Holm is the default because it is uniformly stronger.
How do you read the Wilson-score confidence interval?
The 95% confidence interval around the Δ-rate shows the band the true difference between variant A and B lies in — with 95% probability across repeated tests. The Wilson-score method is more robust than the naive Normal approximation, especially with small samples or extreme rates (near 0 or 1). We use it for both single proportions and combine via a Newcombe-style approximation for the difference.
Practically: when the CI lies entirely above zero, variant B is demonstrably better. When the CI includes zero, the effect is unclear — could be zero, could be positive, could be negative. With a point lift of +2 pp and CI [−0.5 pp, +4.5 pp] the right answer is “keep collecting”, not “roll out”. The CI is the more honest form of the significance statement.
What is deliberately not built?
- No multi-armed bandits / Thompson sampling — that is platform territory. Dynamic traffic reallocation needs a testing system, not a calculator.
- No survival curves or Poisson means — evanmiller.org covers the long-tail tests well; we stay with Bernoulli data (two proportions).
- No “test duration in days” output with traffic estimator — that depends too much on the seasonality of your product to be useful.
- No account / save features — permalink is enough. Anyone needing to persist tests should use a real test-management tool.
- No CUPED / stratified sampling — belongs in the testing platform, not in a calculator.
Where to read more
- Wikipedia — Two-proportion Z-test article — the math backbone
- Wikipedia — Beta distribution — the Bayesian posterior for Bernoulli data
- Johari et al., “Always Valid Inference” — the original mSPRT paper
- Wikipedia — Bonferroni correction — the multiple-testing standard
- Holm-Bonferroni method — step-down procedure
- Sample-Ratio Mismatch explained — the most important pre-flight check on every A/B read
Last updated: