How do you calculate A/B test significance?

You calculate significance by running a [two-proportion Z-test](https://en.wikipedia.org/wiki/Test_statistic) on the conversion rates of the two variants. The pooled-variance Z-statistic under the null produces a p-value via the standard normal CDF. If p is below α (typically 0.05), the difference is statistically significant. This calculator returns p in milliseconds and shows the Bayesian probability that variant B is better alongside — plus a Wilson-score confidence interval on the absolute difference of rates.

What is the difference between frequentist and Bayesian A/B testing?

Frequentist asks: 'How unlikely is this result if both variants were equally good?' Bayesian asks: 'How likely is it that variant B is actually better?' The p-value does not directly tell you the probability of your hypothesis — the [posterior `P(B > A)` from a Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) does. We show both side by side to protect against misinterpretation. With small samples the Bayesian view is often the more honest answer because the frequentist Normal approximation gets noisy.

What is Peek-Safe mode with mSPRT?

Peek-Safe means you can check the test multiple times during the run without inflating the false-positive rate. Repeated peeking and stopping at the first p<0.05 is one of the most common A/B-testing mistakes. The naive Z-test is only valid once — peeking multiple times produces far more false positives than the nominal 5%. [Always-Valid Inference (Johari et al., arXiv:1512.04922)](https://arxiv.org/abs/1512.04922) introduces the mixture Sequential Probability Ratio Test, an always-valid p-value that stays valid under any stopping rule. Toggle Peek-Safe on if you have looked at the test more than twice.

How large does my A/B test sample need to be?

The required sample size depends on four numbers: baseline rate, MDE (Minimum Detectable Effect), power (typically 80%), and significance level α (typically 5%). At baseline 5%, relative MDE +20%, power 80%, and α=0.05 you need roughly 3,840 visitors per variant. The Sample-Size tab in this calculator does the math; industry benchmarks (e-commerce, B2B SaaS, newsletter) are pre-set. Rule of thumb: for baseline rates below 1% you typically need six- to seven-figure visitor counts per arm.

What is the Bonferroni correction in multi-variant A/B tests?

Running three or four variants against control at the same time is a multiple-testing problem. With three comparisons at α=0.05, the family-wise error rate jumps to about 14% if you do not correct. [Bonferroni](https://en.wikipedia.org/wiki/Bonferroni_correction) corrects conservatively by dividing α by the number of comparisons — for three tests that means α=0.0167 per comparison. The [Holm-Bonferroni method](https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method) is uniformly more powerful at the same FWER control. The calculator surfaces both corrections directly under the multi-variant output.

What is a sample-ratio mismatch (SRM)?

Sample-ratio mismatch means your traffic split is not 50/50 when it should be. With a clean randomisation both arms of an A/B test should receive roughly the same visitor count — minor deviations are normal, large differences are an alarm. SRM commonly happens when a bug skews bucket assignment, a caching layer drops the cookie for bots, or a conversion-pixel race condition makes visitors fall out of tracking. The realism gate runs a χ² test on a 50/50 split at α=0.001 and warns the moment that fails. Stop the test, fix randomisation, then re-read — mandatory per the [SRM Cheat Sheet at Seer Interactive](https://www.seerinteractive.com/insights/sample-ratio-mismatch-srm-explanation).

Does my conversion data leave the browser?

No. Conversion numbers are business-sensitive — competitors would love to know how many buyers you convert per million visitors. This calculator never sends a single request to a server. You can verify by opening F12, clicking Network, filtering All, typing the numbers — no POST, no WebSocket, nothing. The Bayesian sampling uses a seeded pseudo-random-number generator with no global randomness and no time-of-day input. Permalink sharing rides in the URL hash only, never in server storage. The hash is built and read entirely client-side.

What should I do when the p-value is borderline (p ≈ 0.05)?

When p is around 0.05, collect more sample size rather than stopping early. The p-value has no sharp threshold — 0.049 and 0.051 are statistically indistinguishable. Read the Bayesian output in parallel: if `P(B > A)` is above 95% the case is clear; if it is only 80% the data is ambivalent. If you have looked at the test multiple times, switch to Peek-Safe immediately — the naive Z-test is no longer valid. Practical rule: run to the pre-computed sample size; never stop because the number dipped below 0.05.

A/B Test Significance — Bayes & Peek-Safe mSPRT

What does this A/B test significance calculator measure?

You drop two conversion counts into the calculator — visitors and conversions per variant — and get back whether the difference is statistically significant. But “statistically significant” alone is not enough in 2026. This calculator gives four views on the same data:

p-value (frequentist) — the classical answer: how unlikely is this result if both variants were equally good? A result below α=0.05 counts as significant.
P(B > A) (Bayesian) — the more direct answer: how likely is it that variant B is genuinely better? At 96% the call is clear, even when p sits at the frequentist borderline.
Always-valid p (mSPRT) — the Peek-Safe value for anyone who looked at the running test more than once. Never smaller than the naive Z-test, often more realistic.
Wilson-score confidence interval — the band the true Δ-rate lies in with 95% probability. A CI that includes zero is the honest version of “not significant”.

With three or more variants a multi-variant table appears automatically — pairwise vs control, with Bonferroni and Holm corrections. The realism-gate banner warns on tiny samples (n<100/variant), low power, MDE above 50%, or sample-ratio mismatch (χ² test on 50/50 split).

Frequentist or Bayesian — which one when?

Frequentist and Bayesian answer different questions. Understanding the distinction leads to better decisions.

The frequentist p-value answers: “Assuming both variants were equally good — how unlikely would this observed result (or more extreme) be?” The threshold α (usually 5%) is the willingness to accept a false-positive. A p of 0.03 does NOT mean variant B is better with 97% probability — that is a common misinterpretation.

The Bayesian posterior P(B > A) answers the question most stakeholders are actually asking: “How likely is it that B is better?” Using a Beta distribution and Monte-Carlo sampling (50,000 samples, deterministically seeded), we compute the posterior from a uniform Beta(1,1) prior plus the observed data. The number is directly interpretable.

In practice: when one variant dominates, both views agree. At the borderline (p≈0.05) the Bayesian view helps enormously: when P(B > A) sits at 95% the case is clear; at 75% the data is ambivalent. Reading the Bayesian view alongside protects against the most common p-value misinterpretation.

What is Peek-Safe mSPRT and why do I need it?

Repeatedly peeking at a running A/B test without corrected statistics is one of the most common mistakes in product testing. Anyone who checks every day and stops at the first p<0.05 does NOT have a 5% false-positive rate — they have 25–50%, depending on how often they look.

The phenomenon is the “Sequential Testing Problem” and has been known since the 1940s. The modern fix comes from the Optimizely Stats-Engine paper Johari et al., “Always Valid Inference”, arXiv:1512.04922: the mixture Sequential Probability Ratio Test (mSPRT) gives an always-valid p-value that stays valid under ANY stopping rule. You may peek whenever and as often as you like.

The trade-off: mSPRT is more conservative than the naive Z-test — you need slightly more data for the same significance call. In exchange you get honest results. Toggle “Peek-Safe (mSPRT)” up top and the calculator switches — showing both the always-valid-p and the naive p side by side so the difference is visible. Rule of thumb: looked at the running test more than twice? Use mSPRT.

Realism gate — when should I not even compute?

Statistics can produce very precise numbers for very wrong models. Four standard cases where the math does not match reality — the calculator surfaces a banner in each:

Sample < 100/variant: the Normal approximation of the Z-test is unreliable. At n=50 the reported p-value can be off by orders of magnitude.
Power < 80%: a non-significant result is uninformative when the sample was always too small for the hoped-for effect size. Use the Sample-Size tab.
MDE above 50% relative: anyone hunting a +50% lift is hunting a miracle. Realistic A/B test effects are +1% to +20%; everything above is suspect.
Conversion rate = 0: the Z-test is mathematically undefined when one variant has zero conversions. The Wilson-score CI gives an upper bound; gather more data.

Plus: with two variants the calculator runs a χ² test on the 50/50 split (critical value ≈ 10.83 at α=0.001). If it fires, you have a sample-ratio mismatch — fix randomisation before believing the p-value.

When do you use Bonferroni versus Holm correction?

Anyone testing three or four variants at once forgets easily that the family-wise error rate climbs. With three comparisons at α=0.05 you have a 14% chance of a false-positive finding somewhere in the family — even if all variants were equally good.

Bonferroni correction divides α by the number of comparisons. With three tests, α=0.0167 per comparison. Very conservative and very simple. Holm-Bonferroni is uniformly more powerful at the same FWER control — sort p-values ascending and check stepwise against α/m, α/(m-1), …, α/1. The first non-significant comparison blocks all subsequent ones.

The calculator shows both corrections side by side so you can see which comparison survives under which method. Rule of thumb: more than two comparisons → Bonferroni is the minimum, Holm is the default because it is uniformly stronger.

How do you read the Wilson-score confidence interval?

The 95% confidence interval around the Δ-rate shows the band the true difference between variant A and B lies in — with 95% probability across repeated tests. The Wilson-score method is more robust than the naive Normal approximation, especially with small samples or extreme rates (near 0 or 1). We use it for both single proportions and combine via a Newcombe-style approximation for the difference.

Practically: when the CI lies entirely above zero, variant B is demonstrably better. When the CI includes zero, the effect is unclear — could be zero, could be positive, could be negative. With a point lift of +2 pp and CI [−0.5 pp, +4.5 pp] the right answer is “keep collecting”, not “roll out”. The CI is the more honest form of the significance statement.

What is deliberately not built?

No multi-armed bandits / Thompson sampling — that is platform territory. Dynamic traffic reallocation needs a testing system, not a calculator.
No survival curves or Poisson means — evanmiller.org covers the long-tail tests well; we stay with Bernoulli data (two proportions).
No “test duration in days” output with traffic estimator — that depends too much on the seasonality of your product to be useful.
No account / save features — permalink is enough. Anyone needing to persist tests should use a real test-management tool.
No CUPED / stratified sampling — belongs in the testing platform, not in a calculator.

Where to read more

Wikipedia — Two-proportion Z-test article — the math backbone
Wikipedia — Beta distribution — the Bayesian posterior for Bernoulli data
Johari et al., “Always Valid Inference” — the original mSPRT paper
Wikipedia — Bonferroni correction — the multiple-testing standard
Holm-Bonferroni method — step-down procedure
Sample-Ratio Mismatch explained — the most important pre-flight check on every A/B read

A/B Test Significance — Bayes & Peek-Safe mSPRT

Variants

Settings

Demos

A vs B result

Copy permalink

How It Works

Paste text or code

Instant processing

Copy result

Privacy

How do you use this tool?