A/B Test Duration Calculator

Calculate how long to run an A/B test for statistically significant results

About This Tool

Stopping an A/B test the moment p drops below 0.05 is statistically wrong, but it's how most tests get called. "Peeking" at p-values inflates false-positive rates well above the nominal 5% — sometimes up to 30% if you check daily. The fix is computing required sample size up front and committing to running the full duration.

Provide baseline conversion rate, minimum detectable effect, statistical power (typically 80%), and significance level (typically 5%), and the calculator returns the per-variant sample size needed and the test duration based on your traffic. Bigger effects need fewer samples; smaller effects need exponentially more. Tests aimed at detecting a 1% lift on a 2% baseline need millions of users.

The most common failure mode is targeting effects too small for available traffic. If your traffic gives you 90 days to detect 5% lifts, hunting for 1% improvements isn't statistically possible — you'll either accept high false-positive rates or run forever.

The statistical foundation is the same hypothesis-testing framework medical researchers use. Power analysis answers: given a baseline rate p, a minimum detectable effect (MDE), a desired statistical power (typically 80%), and a significance level (typically 5% two-tailed), how many samples per variant do I need so my test reliably detects the effect when it's real and reliably rejects when it isn't? The math involves the variance of the proportions and the z-scores for the chosen power and significance levels. The calculator does the arithmetic; the inputs are where teams go wrong.

Worked example: an e-commerce site with 3% baseline conversion rate wants to detect a 10% relative lift (so target rate 3.3%) at 80% power and 5% significance. Required sample per variant ≈ 25,000 visitors. With 5,000 daily visitors split across two variants, that's 10 days minimum. Now try detecting a 5% relative lift instead: required sample jumps to ~100,000 per variant, or 40 days at the same traffic. Halving the MDE quadruples the sample size — this is the n ≈ 1/MDE² scaling that surprises every product team. The calculator shows the curve; people consistently underestimate it in their planning.

The most common failure mode is "peeking" — checking results before the test ends and stopping early when p < 0.05. This inflates false-positive rates. Each peek with intent to stop adds risk; ten peeks at a daily-monitored test can push false-positive rate from the nominal 5% to 20-30%. The calculator can't enforce no-peeking, but it surfaces the right sample size up front so you have a number to commit to. The discipline of running tests for the planned duration regardless of mid-test signals is what separates teams getting reliable results from teams shipping based on noise.

A structural caveat: traffic isn't usually randomly distributed across time. Tuesday traffic differs from Saturday; Black Friday differs from a normal week. A test running for less than a full week introduces day-of-week confounding; a test running through a holiday or marketing campaign introduces seasonal confounding. Best practice: run for at least one full week (preferably two), and avoid running tests through known anomalous periods. The duration the calculator shows is a minimum; real-world reasons often justify running longer.

The about text and FAQ on this page were drafted with AI assistance and reviewed by a member of the Coherence Daddy team before publishing. See our Content Policy for editorial standards.

Frequently Asked Questions