Skip to content
TEST CALCULATOR

A/B Test Significance — Bayes & Peek-Safe mSPRT

Frequentist + Bayesian side by side. Peek-Safe mode for multiple looks. Bonferroni and Holm for multi-variant tests. Realism gate against zero-conversion nonsense.

Runs locally in the browser — conversion data is never uploaded.

Variants

At least two variants. Up to five arms for multi-variant tests — Bonferroni and Holm corrections appear automatically.

A
Rate 8.00%
B
Rate 10.00%

Settings

Engine

Frequentist and Bayesian appear side by side; Peek-Safe switches to an always-valid p-value when you have looked at the test multiple times.

Confidence

Two-sided is the default. Use one-sided only if you fixed the direction of the hypothesis in advance.

Demos

Click-to-load examples — each demo includes a teaching takeaway.

A vs B result

Not significant
p-value
0.1181
P(B > A)
94.0%
Δ rate
2.00 pp
Lift (relative)
25.0%
95% CI on Δ rate
[-0.53 pp, 4.52 pp]

Copy permalink

Permalink carries inputs in the URL hash — no server, no account, no data leakage.

How It Works

  1. 01

    Paste text or code

    Paste your content into the input field or type directly.

  2. 02

    Instant processing

    The tool processes your content immediately and shows the result.

  3. 03

    Copy result

    Copy the result to your clipboard with one click.

Privacy

All calculations run directly in your browser. No data is sent to any server.

An A/B test calculator that shows the frequentist p-value AND the Bayesian posterior probability `P(B > A)` side by side — with a Peek-Safe mode for tests you have looked at multiple times. Realism gate warns on tiny samples, MDE above 50%, or sample-ratio mismatch. Multi-variant tests get Bonferroni and Holm corrections inline. Conversion data is business-sensitive and never leaves the browser.

01 — How to Use

How do you use this tool?

  1. Enter variants — visitors and conversions per arm. Minimum A + B; up to five variants supported.
  2. Pick confidence 90/95/99%. One-sided only when the direction of the hypothesis was fixed in advance.
  3. Choose engine: Frequentist gives the p-value, Bayesian shows P(B>A), Peek-Safe (mSPRT) handles tests you have peeked at.
  4. Heed realism-gate warnings — n<100/variant or zero conversions makes the Z-test unreliable.
  5. Sample-Size tab plans the test ahead; copy the permalink to share via URL hash without a server.

What does this A/B test significance calculator measure?

You drop two conversion counts into the calculator — visitors and conversions per variant — and get back whether the difference is statistically significant. But “statistically significant” alone is not enough in 2026. This calculator gives four views on the same data:

  • p-value (frequentist) — the classical answer: how unlikely is this result if both variants were equally good? A result below α=0.05 counts as significant.
  • P(B > A) (Bayesian) — the more direct answer: how likely is it that variant B is genuinely better? At 96% the call is clear, even when p sits at the frequentist borderline.
  • Always-valid p (mSPRT) — the Peek-Safe value for anyone who looked at the running test more than once. Never smaller than the naive Z-test, often more realistic.
  • Wilson-score confidence interval — the band the true Δ-rate lies in with 95% probability. A CI that includes zero is the honest version of “not significant”.

With three or more variants a multi-variant table appears automatically — pairwise vs control, with Bonferroni and Holm corrections. The realism-gate banner warns on tiny samples (n<100/variant), low power, MDE above 50%, or sample-ratio mismatch (χ² test on 50/50 split).

Frequentist or Bayesian — which one when?

Frequentist and Bayesian answer different questions. Understanding the distinction leads to better decisions.

The frequentist p-value answers: “Assuming both variants were equally good — how unlikely would this observed result (or more extreme) be?” The threshold α (usually 5%) is the willingness to accept a false-positive. A p of 0.03 does NOT mean variant B is better with 97% probability — that is a common misinterpretation.

The Bayesian posterior P(B > A) answers the question most stakeholders are actually asking: “How likely is it that B is better?” Using a Beta distribution and Monte-Carlo sampling (50,000 samples, deterministically seeded), we compute the posterior from a uniform Beta(1,1) prior plus the observed data. The number is directly interpretable.

In practice: when one variant dominates, both views agree. At the borderline (p≈0.05) the Bayesian view helps enormously: when P(B > A) sits at 95% the case is clear; at 75% the data is ambivalent. Reading the Bayesian view alongside protects against the most common p-value misinterpretation.

What is Peek-Safe mSPRT and why do I need it?

Repeatedly peeking at a running A/B test without corrected statistics is one of the most common mistakes in product testing. Anyone who checks every day and stops at the first p<0.05 does NOT have a 5% false-positive rate — they have 25–50%, depending on how often they look.

The phenomenon is the “Sequential Testing Problem” and has been known since the 1940s. The modern fix comes from the Optimizely Stats-Engine paper Johari et al., “Always Valid Inference”, arXiv:1512.04922: the mixture Sequential Probability Ratio Test (mSPRT) gives an always-valid p-value that stays valid under ANY stopping rule. You may peek whenever and as often as you like.

The trade-off: mSPRT is more conservative than the naive Z-test — you need slightly more data for the same significance call. In exchange you get honest results. Toggle “Peek-Safe (mSPRT)” up top and the calculator switches — showing both the always-valid-p and the naive p side by side so the difference is visible. Rule of thumb: looked at the running test more than twice? Use mSPRT.

Realism gate — when should I not even compute?

Statistics can produce very precise numbers for very wrong models. Four standard cases where the math does not match reality — the calculator surfaces a banner in each:

  1. Sample < 100/variant: the Normal approximation of the Z-test is unreliable. At n=50 the reported p-value can be off by orders of magnitude.
  2. Power < 80%: a non-significant result is uninformative when the sample was always too small for the hoped-for effect size. Use the Sample-Size tab.
  3. MDE above 50% relative: anyone hunting a +50% lift is hunting a miracle. Realistic A/B test effects are +1% to +20%; everything above is suspect.
  4. Conversion rate = 0: the Z-test is mathematically undefined when one variant has zero conversions. The Wilson-score CI gives an upper bound; gather more data.

Plus: with two variants the calculator runs a χ² test on the 50/50 split (critical value ≈ 10.83 at α=0.001). If it fires, you have a sample-ratio mismatch — fix randomisation before believing the p-value.

When do you use Bonferroni versus Holm correction?

Anyone testing three or four variants at once forgets easily that the family-wise error rate climbs. With three comparisons at α=0.05 you have a 14% chance of a false-positive finding somewhere in the family — even if all variants were equally good.

Bonferroni correction divides α by the number of comparisons. With three tests, α=0.0167 per comparison. Very conservative and very simple. Holm-Bonferroni is uniformly more powerful at the same FWER control — sort p-values ascending and check stepwise against α/m, α/(m-1), …, α/1. The first non-significant comparison blocks all subsequent ones.

The calculator shows both corrections side by side so you can see which comparison survives under which method. Rule of thumb: more than two comparisons → Bonferroni is the minimum, Holm is the default because it is uniformly stronger.

How do you read the Wilson-score confidence interval?

The 95% confidence interval around the Δ-rate shows the band the true difference between variant A and B lies in — with 95% probability across repeated tests. The Wilson-score method is more robust than the naive Normal approximation, especially with small samples or extreme rates (near 0 or 1). We use it for both single proportions and combine via a Newcombe-style approximation for the difference.

Practically: when the CI lies entirely above zero, variant B is demonstrably better. When the CI includes zero, the effect is unclear — could be zero, could be positive, could be negative. With a point lift of +2 pp and CI [−0.5 pp, +4.5 pp] the right answer is “keep collecting”, not “roll out”. The CI is the more honest form of the significance statement.

What is deliberately not built?

  • No multi-armed bandits / Thompson sampling — that is platform territory. Dynamic traffic reallocation needs a testing system, not a calculator.
  • No survival curves or Poisson means — evanmiller.org covers the long-tail tests well; we stay with Bernoulli data (two proportions).
  • No “test duration in days” output with traffic estimator — that depends too much on the seasonality of your product to be useful.
  • No account / save features — permalink is enough. Anyone needing to persist tests should use a real test-management tool.
  • No CUPED / stratified sampling — belongs in the testing platform, not in a calculator.

Where to read more

Last updated:

You might also like