Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Sequential testing: How to peek at A/B test results without ruining validity

Mon Jun 23 2025

Ever check your A/B test results every hour, hoping to see that beautiful green "significant" label? You're not alone - and you're also playing with statistical fire.

This constant checking (what stats nerds call "peeking") can trick you into thinking you've found a winner when you haven't. The good news? There's a way to monitor your tests continuously without lying to yourself about the results.

The peeking problem in A/B testing

Here's the deal: traditional A/B testing is like baking a cake. You're supposed to set the timer, walk away, and only check when it's done. But in the real world of online experiments, that's just not how we work. We check. We refresh. We peek.

The problem is that every time you look at your results, you're essentially rolling the dice again. Check 20 times instead of once? You've just inflated your false positive rate from 5% to potentially 40% or higher. That "significant" result you're celebrating might just be statistical noise wearing a party hat.

Traditional fixed-horizon tests weren't built for our always-on, continuously monitored world. They assume you'll:

Pick a sample size
Run the test to completion
Check once at the end

But let's be honest - who actually does that? When you can see data streaming in real-time, the temptation to peek is irresistible. And why shouldn't you look? Quick decisions often matter more than perfect statistics.

This is where sequential testing comes to the rescue. Instead of pretending we won't peek, sequential testing acknowledges reality and adjusts the math accordingly. It's like having a statistical bodyguard that protects you from your own impatience.

Sequential testing as a solution

Sequential testing is basically A/B testing that expects you to peek. Instead of slapping your wrist for checking early, it adjusts its calculations on the fly to keep your error rates in check.

Think of it this way: regular A/B testing is like a pregnancy test - you wait the full time, then read the result once. Sequential testing is more like a heart monitor - it's designed to be watched continuously.

The magic happens through statistical adjustments that essentially raise the bar for significance as you check more often. Early in the test, you need overwhelming evidence to call a winner. As more data comes in, the confidence intervals tighten up, and the threshold for significance becomes more reasonable.

Statsig's implementation handles these adjustments automatically. You get to monitor your experiments in real-time without the guilt (or the inflated false positive rates). The platform continuously recalculates what "significant" means based on how long your test has been running and how many times the data has been analyzed.

What makes this especially valuable is that you can act on results as soon as they're genuinely significant - not just statistically noisy. Found a serious bug that's tanking conversion? You'll know within hours, not weeks. Discovered a killer feature that's crushing it? Ship it before your competitors even finish their testing cycle.

Implementing sequential testing in experiments

Setting up sequential testing isn't rocket science, but you do need to make a few decisions upfront. First up: tuning parameters. These control how conservative or aggressive you want to be:

Want to catch big effects quickly? You can tune for that
Worried about false positives? Dial up the conservatism
Need precise effect size estimates? Plan to run longer

You'll also need stopping rules - basically, the conditions that end your test. Common approaches include:

Stop when you hit significance (for better or worse)
Stop after a maximum time period
Stop when the confidence interval is narrow enough

As data rolls in, the confidence intervals dynamically adjust. Early on, they're wide - like training wheels for your statistical conclusions. Over time, they tighten up as the evidence becomes clearer.

The key to interpreting results? Look at trends, not moments. That spike at 2 PM on Tuesday? Probably noise. A consistent 3% lift over three days? Now we're talking. The folks on Reddit's data science community learned this the hard way - early winners often become losers by day 7.

Here's my practical advice for getting the most out of sequential testing:

Start with a hypothesis, not a fishing expedition
Define your key metric upfront - don't go hunting for significance
Account for weekly patterns - Monday's data tells a different story than Friday's
Run important tests to completion even if they hit early significance

Sequential testing shines brightest when you're looking for quick wins or protecting against disasters. It's perfect for catching that broken checkout flow before it costs you thousands. But if you need to know whether your new feature drives exactly 2.3% more engagement? You might still want to let that test run its full course.

Advantages and best practices of sequential testing

The biggest win with sequential testing? Speed. You can identify regressions in hours instead of weeks. That broken payment flow? Caught and fixed before lunch. The feature that's unexpectedly doubling engagement? Rolled out to everyone while your competitors are still "gathering data."

But (and there's always a but), sequential testing involves trade-offs. You're essentially trading some statistical power for the ability to make faster decisions. If you don't find significance early, you might have a harder time detecting smaller effects later. It's like using binoculars that are great for spotting elephants but might miss the mice.

Here's how I recommend using sequential testing in practice:

Do use sequential testing for:

Catching bugs and regressions fast
Making ship/no-ship decisions on single metrics
High-traffic experiments where data accumulates quickly
Situations where speed matters more than precision

Also run traditional tests when:

You need precise effect size estimates
Multiple metrics matter equally
The cost of a wrong decision is high
You're testing subtle UX changes

The best teams use both approaches. Run sequential tests to catch the obvious winners and losers early. Use fixed-horizon tests for the nuanced stuff that needs careful analysis.

A few more tips from the trenches:

Document your decision criteria before starting - it's too easy to move goalposts when you're watching results
Watch out for novelty effects - that early spike might just be users exploring
Consider your traffic patterns - B2B products see different weekly cycles than consumer apps
Don't peek at segments unless you've adjusted for multiple comparisons

Remember, data peeking isn't inherently evil - it's peeking without the right statistical framework that gets you in trouble.

Closing thoughts

Sequential testing isn't a magic bullet, but it's a powerful tool for the modern experimenter's toolkit. It lets you have your cake and eat it too - continuous monitoring without the statistical hangover.

The key is knowing when to use it. Need to ship fast and catch major issues? Sequential testing is your friend. Running a delicate test where every basis point matters? Maybe stick with the traditional approach.

Want to dive deeper? Check out:

Statsig's technical guide to sequential testing
Evan Miller's classic posts on early stopping
The original sequential analysis papers by Abraham Wald (if you're feeling ambitious)

Hope you find this useful! Now go forth and peek responsibly.

Permalink: https://www.statsig.com/perspectives/sequential-testing-ab-peek

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Perspectives home

The Statsig Team

Sequential testing: How to peek at A/B test results without ruining validity

The peeking problem in A/B testing

Sequential testing as a solution

Implementing sequential testing in experiments

Advantages and best practices of sequential testing

Closing thoughts

Recent Posts

Sink, swim, or scale: What startups teach us about launching AI

Alexey Komissarouk, Yuzheng Sun, PhD

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan