CUPAC: Controlling pre-experiment bias

Mon Jun 23 2025

Ever run an A/B test that showed amazing results, only to realize later that your test group was already performing better before you even started? You're not alone - this happens more often than most experimenters care to admit. It's called pre-experiment bias, and it can completely derail your testing program if you're not careful.

The good news is that smart folks have been working on this problem for years. They've developed techniques like CUPED and CUPAC that can help you spot and correct for these biases before they tank your results. Let's dig into what these biases look like, why traditional fixes fall short, and how newer methods can save your experiments.

Recognizing pre-experiment bias in A/B testing

What is pre-experiment bias?

Pre-experiment bias is basically when your test and control groups aren't starting from the same place. Maybe your test group has more power users, or they're from a different geographic region with different spending habits. The groups look randomly assigned, but they're actually different in ways that matter to your metrics.

Here's the kicker - these differences can make your results look way better (or worse) than they actually are. You might think your new feature drove a 10% revenue increase, but really your test group just had more high-spenders to begin with. This is the kind of thing that gets people fired when the feature rolls out and those gains mysteriously disappear.

Effects of pre-experiment bias

The most obvious problem is that you'll make bad decisions based on bad data. But it gets worse. Pre-experiment bias doesn't just make your results wrong - it makes them wrong in predictable ways that reinforce whatever you already wanted to believe.

If you're lucky, you'll catch the bias early. If you're not, you might:

  • Ship features that actually hurt your metrics

  • Kill features that would have helped

  • Waste months building on false positive results

  • Lose credibility with your team when results don't replicate

The statistical folks call these Type I and Type II errors, but I prefer to think of them as "shipping garbage" and "killing gold." Both hurt, but in different ways.

What makes this particularly frustrating is that pre-experiment bias can hide in plain sight. Your randomization might be working perfectly, but if certain types of users are more likely to be online when you start your test, or if your test happens to catch a seasonal pattern, you're still biased. Teams at companies like Uber and Airbnb have found that techniques like CUPED and CUPAC can help adjust for these biases, but they're not magic bullets.

Limitations of traditional variance reduction methods

Overview of CUPED

CUPED (Controlled-experiment Using Pre-Experiment Data) is one of those ideas that seems obvious in hindsight. Microsoft's experimentation team popularized it, and the basic idea is simple: use what you know about users before the experiment to adjust for differences after.

Think of it like handicapping in golf. If you know someone usually shoots 10 over par and they shoot 8 over in your tournament, that's actually a good performance for them. CUPED does something similar with your metrics - it adjusts everyone's results based on their historical baseline.

Limitations of CUPED

But here's where CUPED starts to show its age. First off, it assumes people are consistent over time. That works great for metrics like daily active users, but falls apart for anything seasonal or trending. Gaming companies learned this the hard way when CUPED adjustments went haywire during holiday events.

The bigger issue? CUPED is basically useless for new user metrics. No pre-experiment data means no adjustment. So if you're testing onboarding flows or first-time user experiences, you're out of luck. This is where teams started looking at alternatives like CUPAC.

Then there's the trust problem. When CUPED makes big adjustments, people get nervous. I've seen data scientists spend hours explaining why a 2% observed difference became a 5% adjusted difference. Even when the math is right, it feels wrong to non-technical stakeholders. And when those stakeholders control your roadmap, their comfort level matters.

The folks at Statsig have seen this pattern repeatedly - CUPED works great until it doesn't, and when it doesn't, you need something more sophisticated.

Introducing CUPAC: A proactive approach to controlling bias

CUPAC takes the CUPED idea and cranks it up to 11. Instead of just using historical data, it builds predictive models to figure out what would have happened without your experiment.

How CUPAC works

The Netflix data science team pioneered similar approaches when they realized that simple historical averages weren't cutting it for their recommendation experiments. CUPAC follows their lead by using machine learning to predict outcomes based on whatever data you have available.

Here's how it actually works in practice:

  1. Gather your predictors: Pull together anything that might predict your outcome - user demographics, device types, past behavior patterns, even time of day

  2. Train a model: Use machine learning (usually something simple like gradient boosting) to predict what each user's metric would be

  3. Calculate adjustments: Subtract predicted values from actual values to get the "surprise" factor

  4. Run your analysis: Test whether the surprises are different between your groups

The beauty is that CUPAC doesn't care where your predictors come from. No historical data for new users? Use their device type and acquisition channel. Seasonal effects messing with your metrics? Include day of week and time of year as features.

By thinking about the problem as prediction rather than simple averaging, CUPAC opens up a whole new world of bias correction. It's like going from a bicycle to a motorcycle - same basic idea, way more power.

Implementing CUPAC for accurate experimental results

Getting CUPAC right requires more finesse than CUPED, but the payoff is worth it. I've seen teams reduce their experiment runtime by 30-40% just by implementing it properly.

Best practices for applying CUPAC

Start with covariate selection - this is where most teams mess up. You want predictors that correlate with your outcome but have nothing to do with your treatment. User tenure? Great. Time since last purchase? Perfect. Anything that happened after randomization? Stay away.

Cross-validation is your friend here. Split your control group data and see if your model predicts well on held-out users. If it doesn't, your adjustments will add noise instead of removing it. I typically aim for at least 0.3 R-squared on holdout data before trusting the model.

The teams that succeed with CUPAC share a few habits:

  • They start simple (5-10 features max)

  • They validate on multiple past experiments

  • They monitor their models for drift

  • They can explain their adjustments to skeptics

That last point is crucial. Build trust by showing before-and-after distributions, not just summary statistics. When people can see that you're making reasonable adjustments to individual users, not just waving a statistical wand, they're more likely to believe the results.

Challenges and solutions

CUPAC isn't all sunshine and rainbows. The computational overhead can be real - you're training models for every experiment, possibly every day. Statsig and similar platforms handle this by pre-computing features and using efficient algorithms, but if you're rolling your own, budget for the compute costs.

Interpretability remains a challenge. When your random forest makes different adjustments for similar users, explaining why gets tricky. I've found that simple linear models often work nearly as well and are much easier to debug. Start there before reaching for the fancy stuff.

The covariate selection problem deserves its own post, but here's the short version: automate what you can, but keep humans in the loop. Let your system suggest covariates based on correlation analysis, but have experienced experimenters review the list. Domain knowledge beats statistical significance every time.

One thing that helps - run CUPAC alongside your regular analysis for a few months. Compare the results, see where they differ, and build intuition for when the adjustments matter. Once your team sees CUPAC catching real biases that would have led to bad decisions, adoption becomes much easier.

Closing thoughts

Pre-experiment bias is one of those problems that keeps experienced experimenters up at night. Just when you think you've run a clean test, you discover your groups weren't comparable to begin with. But with techniques like CUPED and especially CUPAC, we finally have tools to fight back.

The key is to start simple and build trust. Run these methods alongside your current approach, validate on historical experiments, and be transparent about what they're doing. Your future self will thank you when you catch that bias before shipping a broken feature.

Want to dive deeper? Check out:

Hope you find this useful! And remember - a biased experiment caught early is infinitely better than a biased decision made too late.



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy