CUPED Explained

Thu May 18 2023

Craig Sexauer

Data Scientist, Statsig

CUPED is slowly becoming a common term in online experimentation since its coining by Microsoft in 2013.

Meaning Controlled-experiment Using Pre-Experiment Data, CUPED is frequently cited asā€”and used asā€”one of the most powerful algorithmic tools for increasing the speed and accuracy of experimentation programs.

In this article, weā€™ll:

  • Cover the background of CUPED

  • Illustrate the core concepts behind CUPED

  • Show how you can leverage this tool to run faster and less biased experiments

What CUPED solves:

As an experiment matures and hits its target date for readout, itā€™s not uncommon to see a result that seems to be only barely outside the range where it would be treated as statistically significant. In a frequentist world, this isnā€™t sufficient evidence that your change caused a change in user behavior.

nearly significant result

If there was a real effect, you needed more sample size to increase your chances of getting a statistically significant result. In an experiment, the standard error or ā€œnoiseā€ goes down with the square root of your sample size. However, sample size is an expensive resource, usually proportional the enrollment window of your experiment.

Waiting for more samples delays your ability to make an informed decision, and it doesnā€™t guarantee youā€™ll observe a statistically significant result when there is a real effect.

Even at companies with immense scale like Facebook and Amazon, people have to deal with the pain of waiting for experiments to enroll users and mature because theyā€™re usually looking for relatively small effects.

Consider this: A 0.1% increase to revenue at Facebook is worth upwards of $100 million per year!

For smaller companies, small effect sizes can become infeasible to measure. It would just take too long to get the sample needed to reliably observe a statistically significant change in their target metric.

Because of this cost, a number of methods have been developed in order to decrease the standard error for the same metric and sample size.

CUPED is an extremely popular implementation that uses pre-experiment data to explain away some of the variance in the result data.

The statistical concept behind CUPED

Like many things in experimentation, the core concept behind CUPED is simple, but its implementation can be tricky (and expensive!).

The guiding principle of CUPED is that not all variance in an experiment is random. In fact, a lot of the differences in user outcomes are based on pre-existing factors that have nothing to do with the experiment.

Letā€™s talk about this for a minute:

Say we want to run a test to see if people run slower with weights attached to them. From a physics perspective, the answer seems pretty obvious. We might record data like this:

test group 1

If we average out our results, we might clearly see the expected effect, but we might not; thereā€™s a lot of variance and overlap in the observed mile times. It should be pretty clear, however, that how fast the runners already were might be an underlying factor. What if we asked them to run a mile a week ago to establish a baseline?

test group 2

In the context of their ā€œtypicalā€ mile time, this effect should be much clearer! Weā€™ve implicitly switched from caring about their raw ā€œmile timeā€ into caring about the difference from what weā€™d expect!

By doing this, weā€™ve also ā€œexplainedā€ some of the noise and variance in the experiment metric. Before, we saw a difference of 140 seconds between the fastest and slowest runner. Now, weā€™ve reduced the range in our metric to 65 seconds; this lower range should mean that the variance weā€™d use to calculate confidence intervals and p-values will be lower.

This is conceptually very similar the original implementation of CUPED; we use the pre-experiment data for a metric to normalize the post-experimental values. How much we normalize is based on how well the pre-experiment data predicts the experiment data - weā€™ll dive into this later.

Bias correction

Because experimental groups are randomly assigned, thereā€™s a chance that the two groups randomly have different baseline run times. If youā€™re unlucky, that difference could even be statistically significant. This means that even if the weights did nothing, you might conclude that thereā€™s a difference between the two groups.

If you have access to that baseline data, itā€™d be possible to conclude that there was a pre-existing difference and be wary of the results. In the example below, itā€™s pretty obvious that the difference in the groups before the test would make the results extremely skewed:

average mile time versus weights versus no weights cuped data example

You might note that you can see that the weighted runnersā€™ times went up, and the unweighted runnersā€™ times went down. This relative change does match our expectation. Would it be possible to infer that there is an effect here? Correcting this data with CUPED can help!

Correction

Conceptually, if one group has a faster average baseline, their experiment results will also be faster. When we apply a CUPED correction, the faster groupā€™s metric will be adjusted downwards relative to the slower group.

In this example, the post-adjustment averages might move something like this, pushing the weights groupā€™s experiment value higher than the control group. We could follow up with a statistical test to understand if the difference in adjusted values is statistically significant.

mile time versus weights versus no weights after cuped applied

Stratification

Some variants of CUPED are ā€˜non-parametricā€™ or ā€˜bucketedā€™. What this usually means is that (in this example) we would split users into groups based on their pre-experiment run times, and measure metrics relative to the average metric value of that group.

For example, consider the data below - this is for the bucket of users who ran between a 6:30 and 6:40 mile in the baseline:

test group cuped data stratification

Other variables

More complex implementations of CUPED donā€™t just rely on a single historical data point for the same metric. They can pull in other information as well, as long as itā€™s independent of the experiment group the user is in.

In the example above, we could add age group as a factor in the experimentation. This has relatively little to do with our experiment, but could be a major factor in peopleā€™s mile times! By including this as a factor in CUPED, we can reduce even more variance.

test group cuped data other variables

Using CUPED in practice

In practice, we canā€™t just subtract out a userā€™s prior values from their experimental values. The reason for this is also conceptually simpleā€”peopleā€™s past behavior isnā€™t always a perfect predictor for their future behavior.

A mental model for the math weā€™ll use

Before we go further, itā€™s useful to understand the relationship between experimentation and regression (the ordinary-least-squares or ā€œOLSā€ regression youā€™d run in excel.)

A T-test for a given metric is mathematically equivalent to running a regression where the dependent variable is your metric and the independent variable is a userā€™s experiment group. To demonstrate this, I generated some data for the example experiment above, where usersā€™ paces are based on a randomly-assigned baseline pace and if theyā€™re in the test group.

The population statistics for this are:

population statistics sample data

Letā€™s compare the outputs of running a T-test and running an OLS where we use the 1-or-0 test flag as the independent variable.

T-test:

t-test sample data

OLS:

OLS example

Comparing these, we notice a lot of similarities:

  • The effect size in our T-test (the delta between test and control) is exactly the same as the ā€œtestā€ variableā€™s coefficient in the OLS regression.

  • The standard error for the coefficient is the same as the standard error for our T-test.

  • The p-value for the ā€œtestā€ variable coefficient is the same as for our t-test!

In short, our standard T-test is basically a regression against a 1-or-0 variable!

they're the same picture meme

When we want to make regressions more accurate, we might add relevant explanatory variables. We can do the same for our test; again, this is the core concept behind CUPED.

Letā€™s include baseline pace as a factor in our regression. We should expect this to change the regression quite a bit, since itā€™s such a powerful explanatory variableā€”and it does.

output cuped

Letā€™s review:

  • The ā€œtestā€ variableā€™s coefficient (the estimate of the experiment effect) didnā€™t change much. Thatā€™s expected - unless there was a significant difference between the groups before the experiment we should get a similar estimate of the experiment effect.

  • The standard error (and accordingly p-value) went down from 4.73 to 2.13. This is because a lot of the noise we previously attributed to our test variable wasnā€™t random: Tt was coming from users having different baselines, which weā€™re now accounting for!

  • Our p-value goes from 0.116 to 0.000 because of the decreased Standard Error. The result, which was previously not statistically significant, is now clearly significant.

Using CUPED with the baseline pace achieves nearly-identical results. To visualize the reduction in Variance/Standard Error, I plotted the distribution of user paces from this sample dataset before and after I applied CUPED:

before cuped vs after cuped example

When we apply CUPED, we see a large reduction in variance and p-value, just like in the regression results. Using the pre-experiment data reduced the variance, p-value, and the data we would need to consistently see this result.

Create a free account

You're invited to create a free Statsig account! Get started today with 2M free events. No credit card required, of course.
an enter key that says "free account"

CUPED math and implementation

For more details on this, please refer to the 2013 Microsoft white paper. Weā€™ve used many formulas that appear in that paper here.

To reduce variance by using other variables, weā€™ll need to make adjustments such that we end up with an unbiased estimator of group means that weā€™ll use in our calculations. An unbiased estimator simply means that the expected value of the estimator is equal to the true value of the parameter weā€™re estimating.

In practice, this means we need to pick an adjustment that is independent of which test group a user is assigned to.

For the original, simplest implementation of CUPED weā€™ll refer to our pre-experiment values as X and our experiment values as Y. Weā€™ll adjust Y to get a covariate-adjusted Ycv according to the formula below:

math 1

Here, Īø could be any derived constant. What this equation means it that, for any Īø, we can take two steps:

  • Multiply the pre-experiment population mean by Īø and add it to each userā€™s result

  • Subtract from each userā€™s result Īø multiplied by their pre-experiment value

This gives us an unbiased estimator Ycv which factors in the covariate into our estimates. We can calculate the variance of the new estimator term:

math 2

This is the variance of our adjusted estimator for Y. This variance turns out to be the smallest for:

math 3

This is the term weā€™d use to calculate the slope in an OLS regression! This is also the term weā€™ll up using in our data transformation - we take all the data in the experiment and calculate this theta. The final variance for our estimator is

math 4

where Ļ is the correlation between X and Y. The correlation between the pre-experiment and post-experiment data is directly linked to how much the variance is reduced. Note that since Ļ is bounded between [-1, 1], this new variance will always be less than or equal to the original variance.

In practice

To create a data pipeline for the basic form of CUPED, you need to carry out the following steps. With X referring to pre-experiment data points and Y points referring to experiment data:

  • Calculate the covariance between Y and X as well as the variance and mean of X. Use this to calculate Īø per the formula above.

    • This requires that users without pre or post-experiment data are included as 0s if they are to be included in the adjustment

  • For each user, calculate the userā€™s individual pre-experiment value. Itā€™s common to choose to not apply an adjustment for users who are not eligible for pre-experiment data (for example new users) - this is effectively a one-level striation.

  • Join the population statistics to the user-level data

  • Calculate userā€™s adjusted terms as Y +Īø*(population mean of X)- ĪøX

  • Run and interpret your statistical analysis as you normally would, using the adjusted metrics as your inputs

Implications from the CUPED math (above):

There are many covariates we could use for variance reduction; the main requirement is that it is independent of the experiment group which the user is assigned to. Generally, data from before an experiment is safest.

We commonly use the same metric from before the experiment as a covariate because in practice itā€™s usually a very effective predictor, and it makes intuitive sense in most cases.

We should calculate the group statistics for the pre-experiment/post-experiment data across the entire experiment populationā€”not on a per-group basisā€”because itā€™s possible thereā€™s an interaction effect between the treatment and the pre-exposure data. For example, users who run faster may be better equipped to run with weights, and so the correlation between the pre and post-periods would be different than for slower users.

New Users wonā€™t have pre-experiment data. An experiment with no pre-experiment data wonā€™t be able to leverage CUPED. In these cases, the best bet is to use covariates like demographics if possible.

If an experiment has some new users and some established users, you can use CUPED and split the population by another binary covariate: Do they have pre-experiment data or not? Functionally, this means you just apply CUPED only on users with pre-experiment data as discussed above.

CUPED best practices

  • CUPED is most effective on existing user experiments where you have access to userā€™s historical data. For new users experiments, stratification or other covariates like demographics can be useful, but you wonā€™t be able to leverage as rich of a covariate.

  • CUPED needs historical data to work; this means that you need to make sure your metric data goes back to before the start of the pre-experiment data window.

  • CUPEDā€™s ability to adjust values is based on how correlated a metric is with its past value for the same user. Some metrics will be very stable for the same user and allow for large adjustments; some are noisy over time for the same user, and you wonā€™t see much of a difference in the adjusted values.

Related reading and resources

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.
request a demo cta image


Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy