Embrace Overlapping A/B Tests and Avoid the Dangers of Isolating Experiments

Platform

Developers

Resources

Pricing

Platform

Developers

Resources

OVERVIEW

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Feature Flags Liberated

Gating features is a core part of the development process. And with Statsig, it's free.

How AI Companies Use Statsig

The best AI companies use Statsig to accelerate growth. Learn how you can do the same.

What is Product Observability?

Product observability means being able to monitor, control, and gain insight into all of your features.

Platform

Developers

Resources

Pricing

OVERVIEW

Statsig Blog

Peak Velocity is our blog where we cover the latest in experimentation and more

Feature Management

Ship faster and more confidently

Experimentation

Run 100s of randomized, multivariate experiments

Data Warehouse

Run experiments natively, in your warehouse

Analytics

Actionable intelligence at your fingertips

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Experiments with Generative AI

We built a generative AI app in reactJS using OpenAI’s API and Statsig. Here’s what we learned:

Experimentation Platforms

The decision to build versus buy an experimentation and feature flagging platform is not an easy one.

CUPED Explained

CUPED is an implementation that uses pre-experiment data to explain the variance in the result data.

Embrace Overlapping A/B Tests and Avoid the Dangers of Isolating Experiments

Fri Oct 08 2021

At Statsig, I’ve had the pleasure of meeting many experimentalists from different backgrounds and experiences.

How to handle simultaneous experiments frequently comes up. Many experimentalists prefer isolating experiments to avoid collisions and maintain experimental integrity. Yet my experience as a Data Scientist at Facebook (a company that ran 10k+ simultaneous experiments) tells me this worry is overblown and can seriously restricts a company’s pace of experimentation and innovation, while still producing bad decisions.

A lot of people say you should wash your whites and colors separately. But if you want to wash other kinds of laundry, you have to separate the whites, light colors, darks, different fabric types, delicates, and all permutations in between. Alternatively, if you just mix them up and wash them together, you could be done with your laundry quickly. You only have to trust that the detergent works.

I’m here to tell you that for A/B testing and overlapping experiments, the detergent works.

Managing Multiple Experiments

managin multiple experiments flow chart diagram

There are several strategies for managing multiple experiments:

Sequential testing — Experiments are ran sequentially, one after the other. You get the full power of your userbase but it takes more time. Sometimes you will have to delay the second experiment if the first one hasn’t finished.
Overlapping testing — Experiments are ran simultaneously over the same set of users. All combinations of test and control are experienced by different users. This method maintains experimental power, but can reduce accuracy (as I’ll explain later, this is not a major drawback).
Isolated tests — Users are split into segments, and each user is enrolled in only one experiment. Your experimental power is reduced, but accuracy of results are maintained.
A/B/n testing — This method launches a joint experiment simultaneously. Experimental power is only slightly reduced as the control group is reused, but the two experiments must be launched and concluded together.
Multivariate testing — This is similar to the overlapping testing, but the experimental analysis involves comparing all combinations of test and control against each other. This maximizes experimental power and accuracy but makes the analysis more complex. This method does not work well if you want to run 3 or more tests (8 variations).

There are tradeoffs between accuracy, experimental power (ie. speed), and scalability. Isolated and overlapping testing are generally preferred and are quite scalable, allowing teams to run dozens of simultaneous experiments. But between the two, there are tradeoffs between speed and accuracy. The heart of the issue is that overlapping experiments maximizes experimental power for every experiment, but can introduce noise as multiple experimental effects push and pull on the metrics. Additionally, the test and control experiences aren’t as concretely defined and it’s less clear what you’re measuring; This makes interpretation of the experiment a little more complicated.

Furthermore, as companies scale experimentation, they will quickly encounter coordination challenges. Isolated testing will inevitably require that an experiment finish to free up users before a new test can be allocated and launched. This can introduce a bottleneck where teams are force experiments to prematurely finish, while delaying the launch of others. It’s worth noting that with isolated experiments, the residual effect from a previous experiment can sometimes affect the experimental results of the next (ie. a hangover effect).

When to Isolate Experiments

Isolating experiments is a useful tool in experimentation. There are situations where it is critical:

Overlap is Technically Impossible: It’s sometimes impossible to overlap multiple experiences to a single user. For example, one cannot test new ranking algorithm A (vs control) and also new ranking algorithm B (vs control) in overlapping tests. As ranking algorithms are commonly the sole decider of ranking and it’s not possible to let A and B run simultaneously.
Strong Interaction Effects are Expected: Some experimental effects are non-linear; ie. 1 + 1 ≠ 2. If effect A is +1% and effect B is +1%, sometimes combining them can lead to 0%, or +5%. Such effects can skew the readout if they’re not isolated. But as I’ll address later, isolating experiments to avoid non-linear effects can lead to wrong decisions.
Breaking the User Experience: Some combinations of experiments can break the user experience and these should be avoided. For example, a test that moves the “Save File” Button to a menu, and another test that simplifies the UI by hiding the menu can combine to confuse users.
A Precise Measurement is the Goal. Sometimes getting an accurate read on an experimental effect is critical and it’s not enough to simply know an effect is good or bad. At Youtube, just knowing that ads negatively affect watch time is insufficient; accurately quantifying the tradeoff is vital to understanding the business.

Interaction Effects are Often Overblown

Running overlapping experiments can increase variance and skew results, but from my experience at Facebook, strong interaction effects are rare. It is more common to find smaller non-linear results. While these can skew the final numbers, it’s rare to find two experiments collide to produce outright misleading results. Effects are generally additive which leads to clean “ship or no ship” decisions. Even when numbers are skewed, they are in the same ballpark and result in the same decisions; overlapping experiments can be trusted to generate directionally accurate results.

In all honesty, since most companies are continuously shipping features, it’s not possible from independent isolated tests to know the true combined effect. Getting an accurate read is better left to long-term holdouts. Furthermore, interaction effects that lead to broken experiences are fairly predictable. Broken experiences or strong interaction effects typically occur at the surface level, eg. a single web page or a feature within an app. At most companies, these are usually controlled by a single team and teams are generally aware of what experiments are running and planned. I have found that interaction effects are easily anticipated and can be managed at the team level.

Isolating Experiments Slows You Down

“Our success is a function of how many experiments we do per year, per month, per week, per day.” — Jeff Bezos

It’s well known that companies like Facebook, Google, Microsoft, Netflix, AirBnB and Spotify run thousands of simultaneous experiments. Their success and growth come from their speed of optimization and hypothesis testing. But even with billions of users, running 10k experiments only averages 100k users per experiment. And as I’ve described previously, these companies are hunting for sub-percentage wins (<1%): 100k users is simply not enough. Suddenly their experimental superpower (pun intended) is dramatically reduced. How do these companies do it?

At Facebook, each team independently managed their own experiments and rarely coordinated across the company. Every user, ad, and piece of content was part of thousands of simultaneous experiments. This works because strong interaction effects are quite rare and typically occur at the feature-level where it’s often managed by a single team. This allows teams to increase their experimental power to hundreds of millions of users, ensuring experiments run quicker and features are optimized faster.

Overlapping experiments does sacrifice precision. However last month’s precise experimental results have probably lost their precision by now. Seasonal effects can come into play, and your users are dynamically changing from external effects. And if you’ve embraced rapid optimization, your product is also changing. In almost all cases, precise and accurate measurements are just an illusion.

For the vast majority of experiments you only need to determine whether an experiment is “good“ or “not good” for your users and business goals. Ship vs no-ship decisions are binary and directional data is sufficient. Getting an accurate read on an experimental effect is secondary. Most non-linear effects either dampen or accentuate your effects, and but generally do not affect the direction.

The Fallacy of Controlling Every Effect

The secret power of randomized controlled experiments is that the randomization controls all known and unknown confounding effects. Just like how randomization controls for demographic, seasonal, behavioral and external effects, randomization also controls the effects of other experiments. Attempting to isolate experiments for the sake of isolation is in my opinion, a false sense of control. The following quote from Lukas Vermeer sums this up well.

“Consider this: your competitor is probably also running at least one test on his site, and a) his test(s) can have a similar “noise effect” on your test(s) and b) the fact that the competition is not standing still means you need to run at least as many tests as he is just to keep up. The world is full of noise. Stop worrying about it and start running as fast as you can to outpace the rest!” — Lukas Vermeer, Director of Experimentation at Vistaprint

As Lukas pointed out, your competitors are surely running experiments on mutual customers and worrying about these effects will only slow you down.

Embracing Interaction Effects in Rapid Experimentation

Hypothetical Button/Background Test with a bad interaction effect

I advocate for embracing interaction effects when doing rapid experimentation. Let’s say you have a team running a blue/red button test, and someone else running a blue/grey background test. In this case, there is a clear interaction effect where blue button and a blue background leads to a broken experience, and both experiments will show blue underperforming. Isolating these experiments can maintain the integrity of results, and produce a clean readout. But I would argue that if both teams decide blue is best and ship blue, you’ll end up with a disastrous result where despite great effort you haven’t escaped the interaction effect… in fact you fell right into the trap. Meanwhile if you simply ran overlapping experiments, you would have ended up with a better result even if it’s suboptimal (both teams would have avoided Blue).

Best Practices for Overlapping Experiments

Top companies with a fully integrated experimentation culture generally overlap experiments by default. This unleashes their full experimental power on every experiment, allowing them to rapidly iterate. This trades off against experimental accuracy and risk of collisions. The following best practices help minimize those risks:

Avoiding Severe Experimental Collisions: Some amount of risk-tolerance is needed as over-worrying about experimental collisions severely dampens speed and scale. The truth is that most experimental collisions can be anticipated in advanced and managed by small teams. Adding some simple safeguards and team processes can be quite effective to minimize this risk without compromising speed. Some tools include sharing the experimentation roadmap, and having infrastructure that supports isolation when necessary. It can also be helpful to add monitoring that detects interaction effects.
Prioritize Directionality over Accuracy: In most cases, experimentation’s primary goal is to achieve a ship or no ship decision, while measurement is purely secondary. It’s far more important to know that an experiment was “good” rather than whether revenue was lifted by 2.0% or 2.8%. Chasing accuracy can be quite misleading.
Special Strategies when Precision matters: In cases where precision matters one should try alternative strategies. Long-term experiments (eg. holdbacks) or multivariate experimentation are great at getting precise results while keeping teams moving.

Try Statsig Today

Get started for free. Add your whole team!

Start for Free

Platform

Developers

Resources

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Feature Flags Liberated

How AI Companies Use Statsig

What is Product Observability?

Platform

Developers

Resources

Pricing

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Experiments with Generative AI

Experimentation Platforms

CUPED Explained

Back to blog home

Embrace Overlapping A/B Tests and Avoid the Dangers of Isolating Experiments

Tim Chan

At Statsig, I’ve had the pleasure of meeting many experimentalists from different backgrounds and experiences.

Managing Multiple Experiments

When to Isolate Experiments

Interaction Effects are Often Overblown

Build fast with Be Significant
Our exclusive startup program

Build fast with Be Significant
Our exclusive startup program