Stratification: Improving test sensitivity

Mon Jun 23 2025

Ever run an A/B test that should have shown a clear winner, but the results were all over the place? You're not alone - most of us have been there, staring at noisy data and wondering if we need a bigger sample size or just better luck.

Here's the thing: the problem might not be your sample size at all. It could be that you're treating all your users like they're the same person when they're actually wildly different. That's where stratification comes in - and it might just save your next experiment.

Understanding stratification and its role in improving test sensitivity

Let me paint you a picture. Stratification is basically splitting your population into groups that actually make sense - like separating apples from oranges before you weigh your fruit basket. When you account for these natural differences upfront, something magical happens: the noise in your data drops dramatically.

Think about it this way. If you're testing a new feature, your power users probably behave totally differently from newcomers. Mix them all together in one big pot, and you'll get muddy results. But separate them out? Now you're cooking with gas.

The colorectal cancer screening study from the Oshima Study Workgroup shows this perfectly. They didn't just throw everyone into the same screening bucket. Instead, they grouped people by age, sex, family history, BMI, and smoking habits. The result? Way better detection rates for advanced colorectal neoplasia. Same screening method, just smarter grouping.

But here's where people mess up - they go stratification crazy. The folks at SWOG Cancer Research Network learned this the hard way. They found that using four or six stratification factors in smaller trials actually made things worse. It's like trying to sort your M&Ms by color, size, shape, and manufacturing date - at some point, you've got more categories than candy.

The sweet spot? One or two really meaningful factors. Pick the characteristics that actually matter for what you're testing. Everything else is just noise dressed up as sophistication.

Implementing stratified sampling to enhance experimental precision

So how do you actually do this? First things first - figure out what characteristics really influence your outcome. Not what you think might matter, but what the data tells you matters.

Let's say you're running an e-commerce experiment. Your stratification factors might be:

  • Purchase frequency (heavy buyers vs. window shoppers)

  • Device type (mobile vs. desktop users behave differently)

  • Geographic region (shipping times affect behavior)

Once you've picked your factors, you allocate your sample proportionally. If 30% of your users are mobile-only, then 30% of both your control and treatment groups should be mobile-only. Simple as that.

The colorectal cancer screening example I mentioned earlier? They combined their risk stratification score with the standard fecal immunochemical test (FIT). By grouping people based on their risk factors first, they caught way more cases of advanced colorectal neoplasia than FIT alone ever could.

Another great example comes from smoking cessation research. The researchers realized that people's likelihood of actually quitting smoking massively affected how they responded to withdrawal treatments. By stratifying based on "abstention potential," they got cleaner data on which interventions actually worked.

But watch out for the pitfalls. The SWOG computational study I mentioned earlier is your cautionary tale. Too many strata means tiny sample sizes in each group. Suddenly you need 10x more participants just to detect the same effect. Not exactly efficient.

Real-world applications of stratification in improving sensitivity

Let's talk about where this stuff really shines - the real world, where messy data is the norm and clean results are the exception.

Healthcare has been all over stratification lately, and for good reason. The Oshima Study Workgroup didn't just improve detection rates - they created a practical screening method that doctors can actually use. By combining risk scores with standard tests, they made the whole process more efficient without adding expensive new procedures.

Clinical trials have been using stratified randomization for years. Kernan and colleagues showed that this approach is especially powerful for smaller trials where every bit of statistical power counts. When you're trying to prove two treatments are equivalent (not just different), stratification can be the difference between a conclusive result and a big question mark.

Emergency rooms are another great example. Tam's team combined ECG readings with high-sensitivity cardiac troponin T tests to better predict which chest pain patients were heading for serious cardiac events. Instead of treating everyone the same, they could quickly identify high-risk patients and allocate resources accordingly. That's stratification saving lives, not just improving p-values.

The cutting edge? Weng and colleagues are using polygenic risk scores combined with lifestyle factors to improve cancer screening. They're basically stratifying at the genetic level, which sounds fancy but boils down to the same principle: group similar people together for better results.

Stratification in A/B testing and online experimentation

Now let's bring this home to where most of us live - A/B testing and online experiments. This is where stratification can turn a mediocre test into a precision instrument.

The basic idea stays the same: divide your users into meaningful groups before you randomize. But online, you've got some unique advantages:

  • User behavior data is plentiful

  • You can stratify in real-time

  • Historical patterns are easy to identify

Here's what typically works well for stratification factors in digital experiments:

  • User tenure (new vs. returning users)

  • Platform (iOS vs. Android vs. web)

  • Engagement level (daily active vs. occasional users)

  • Geographic factors (especially for features with regional differences)

But remember that over-stratification warning from earlier? It hits online testing hard. Every additional stratum means more complexity in your analysis. You might need to check for interactions between your strata and treatment, which can turn a simple test into a statistical nightmare.

Here's my rule of thumb: if you can't explain why a stratification factor matters in one sentence, it probably doesn't. "Mobile users convert differently than desktop users" - clear winner. "Users who joined on a Tuesday might behave differently" - probably overthinking it.

Statsig makes this particularly easy with their stratified sampling features. You can set up your strata, and the platform handles the proportional allocation automatically. No more spreadsheet gymnastics trying to ensure your groups are balanced.

Closing thoughts

Stratification isn't magic - it's just smart grouping. By acknowledging that not all users (or patients, or trial participants) are created equal, you can dramatically improve the sensitivity of your tests.

The key takeaways? Start simple with one or two meaningful factors. Make sure they actually relate to your outcome. And resist the urge to stratify everything just because you can. Sometimes the best stratification strategy is knowing when not to use it at all.

If you want to dive deeper, check out:

  • Statsig's guide on stratified sampling for practical implementation tips

  • The original papers I linked throughout for the nitty-gritty details

  • Your own historical data - often the best teacher for what stratification factors actually matter in your context

Hope this helps you run cleaner, more sensitive experiments. Your future self (and your data team) will thank you!

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy