How to handle simultaneous experiments frequently comes up. Many experimentalists prefer isolating experiments to avoid collisions and maintain experimental integrity. Yet my experience as a Data Scientist at Facebook (a company that ran 10k+ simultaneous experiments) tells me this worry is overblown and can seriously restricts a company’s pace of experimentation and innovation, while still producing bad decisions.
A lot of people say you should wash your whites and colors separately. But if you want to wash other kinds of laundry, you have to separate the whites, light colors, darks, different fabric types, delicates, and all permutations in between. Alternatively, if you just mix them up and wash them together, you could be done with your laundry quickly. You only have to trust that the detergent works.
I’m here to tell you that for A/B testing and overlapping experiments, the detergent works.
There are several strategies for managing multiple experiments:
Sequential testing — Experiments are ran sequentially, one after the other. You get the full power of your userbase but it takes more time. Sometimes you will have to delay the second experiment if the first one hasn’t finished.
Overlapping testing — Experiments are ran simultaneously over the same set of users. All combinations of test and control are experienced by different users. This method maintains experimental power, but can reduce accuracy (as I’ll explain later, this is not a major drawback).
Isolated tests — Users are split into segments, and each user is enrolled in only one experiment. Your experimental power is reduced, but accuracy of results are maintained.
A/B/n testing — This method launches a joint experiment simultaneously. Experimental power is only slightly reduced as the control group is reused, but the two experiments must be launched and concluded together.
Multivariate testing — This is similar to the overlapping testing, but the experimental analysis involves comparing all combinations of test and control against each other. This maximizes experimental power and accuracy but makes the analysis more complex. This method does not work well if you want to run 3 or more tests (8 variations).
There are tradeoffs between accuracy, experimental power (ie. speed), and scalability. Isolated and overlapping testing are generally preferred and are quite scalable, allowing teams to run dozens of simultaneous experiments. But between the two, there are tradeoffs between speed and accuracy. The heart of the issue is that overlapping experiments maximizes experimental power for every experiment, but can introduce noise as multiple experimental effects push and pull on the metrics. Additionally, the test and control experiences aren’t as concretely defined and it’s less clear what you’re measuring; This makes interpretation of the experiment a little more complicated.
Furthermore, as companies scale experimentation, they will quickly encounter coordination challenges. Isolated testing will inevitably require that an experiment finish to free up users before a new test can be allocated and launched. This can introduce a bottleneck where teams are force experiments to prematurely finish, while delaying the launch of others. It’s worth noting that with isolated experiments, the residual effect from a previous experiment can sometimes affect the experimental results of the next (ie. a hangover effect).
Isolating experiments is a useful tool in experimentation. There are situations where it is critical:
Overlap is Technically Impossible: It’s sometimes impossible to overlap multiple experiences to a single user. For example, one cannot test new ranking algorithm A (vs control) and also new ranking algorithm B (vs control) in overlapping tests. As ranking algorithms are commonly the sole decider of ranking and it’s not possible to let A and B run simultaneously.
Strong Interaction Effects are Expected: Some experimental effects are non-linear; ie. 1 + 1 ≠ 2. If effect A is +1% and effect B is +1%, sometimes combining them can lead to 0%, or +5%. Such effects can skew the readout if they’re not isolated. But as I’ll address later, isolating experiments to avoid non-linear effects can lead to wrong decisions.
Breaking the User Experience: Some combinations of experiments can break the user experience and these should be avoided. For example, a test that moves the “Save File” Button to a menu, and another test that simplifies the UI by hiding the menu can combine to confuse users.
A Precise Measurement is the Goal. Sometimes getting an accurate read on an experimental effect is critical and it’s not enough to simply know an effect is good or bad. At Youtube, just knowing that ads negatively affect watch time is insufficient; accurately quantifying the tradeoff is vital to understanding the business.
Running overlapping experiments can increase variance and skew results, but from my experience at Facebook, strong interaction effects are rare. It is more common to find smaller non-linear results. While these can skew the final numbers, it’s rare to find two experiments collide to produce outright misleading results. Effects are generally additive which leads to clean “ship or no ship” decisions. Even when numbers are skewed, they are in the same ballpark and result in the same decisions; overlapping experiments can be trusted to generate directionally accurate results.
In all honesty, since most companies are continuously shipping features, it’s not possible from independent isolated tests to know the true combined effect. Getting an accurate read is better left to long-term holdouts. Furthermore, interaction effects that lead to broken experiences are fairly predictable. Broken experiences or strong interaction effects typically occur at the surface level, eg. a single web page or a feature within an app. At most companies, these are usually controlled by a single team and teams are generally aware of what experiments are running and planned. I have found that interaction effects are easily anticipated and can be managed at the team level.
“Our success is a function of how many experiments we do per year, per month, per week, per day.” — Jeff Bezos
It’s well known that companies like Facebook, Google, Microsoft, Netflix, AirBnB and Spotify run thousands of simultaneous experiments. Their success and growth come from their speed of optimization and hypothesis testing. But even with billions of users, running 10k experiments only averages 100k users per experiment. And as I’ve described previously, these companies are hunting for sub-percentage wins (<1%): 100k users is simply not enough. Suddenly their experimental superpower (pun intended) is dramatically reduced. How do these companies do it?
At Facebook, each team independently managed their own experiments and rarely coordinated across the company. Every user, ad, and piece of content was part of thousands of simultaneous experiments. This works because strong interaction effects are quite rare and typically occur at the feature-level where it’s often managed by a single team. This allows teams to increase their experimental power to hundreds of millions of users, ensuring experiments run quicker and features are optimized faster.
Overlapping experiments does sacrifice precision. However last month’s precise experimental results have probably lost their precision by now. Seasonal effects can come into play, and your users are dynamically changing from external effects. And if you’ve embraced rapid optimization, your product is also changing. In almost all cases, precise and accurate measurements are just an illusion.
For the vast majority of experiments you only need to determine whether an experiment is “good“ or “not good” for your users and business goals. Ship vs no-ship decisions are binary and directional data is sufficient. Getting an accurate read on an experimental effect is secondary. Most non-linear effects either dampen or accentuate your effects, and but generally do not affect the direction.
The secret power of randomized controlled experiments is that the randomization controls all known and unknown confounding effects. Just like how randomization controls for demographic, seasonal, behavioral and external effects, randomization also controls the effects of other experiments. Attempting to isolate experiments for the sake of isolation is in my opinion, a false sense of control. The following quote from Lukas Vermeer sums this up well.
“Consider this: your competitor is probably also running at least one test on his site, and a) his test(s) can have a similar “noise effect” on your test(s) and b) the fact that the competition is not standing still means you need to run at least as many tests as he is just to keep up. The world is full of noise. Stop worrying about it and start running as fast as you can to outpace the rest!” — Lukas Vermeer, Director of Experimentation at Vistaprint
As Lukas pointed out, your competitors are surely running experiments on mutual customers and worrying about these effects will only slow you down.
I advocate for embracing interaction effects when doing rapid experimentation. Let’s say you have a team running a blue/red button test, and someone else running a blue/grey background test. In this case, there is a clear interaction effect where blue button and a blue background leads to a broken experience, and both experiments will show blue underperforming. Isolating these experiments can maintain the integrity of results, and produce a clean readout. But I would argue that if both teams decide blue is best and ship blue, you’ll end up with a disastrous result where despite great effort you haven’t escaped the interaction effect… in fact you fell right into the trap. Meanwhile if you simply ran overlapping experiments, you would have ended up with a better result even if it’s suboptimal (both teams would have avoided Blue).
Top companies with a fully integrated experimentation culture generally overlap experiments by default. This unleashes their full experimental power on every experiment, allowing them to rapidly iterate. This trades off against experimental accuracy and risk of collisions. The following best practices help minimize those risks:
Avoiding Severe Experimental Collisions: Some amount of risk-tolerance is needed as over-worrying about experimental collisions severely dampens speed and scale. The truth is that most experimental collisions can be anticipated in advanced and managed by small teams. Adding some simple safeguards and team processes can be quite effective to minimize this risk without compromising speed. Some tools include sharing the experimentation roadmap, and having infrastructure that supports isolation when necessary. It can also be helpful to add monitoring that detects interaction effects.
Prioritize Directionality over Accuracy: In most cases, experimentation’s primary goal is to achieve a ship or no ship decision, while measurement is purely secondary. It’s far more important to know that an experiment was “good” rather than whether revenue was lifted by 2.0% or 2.8%. Chasing accuracy can be quite misleading.
Special Strategies when Precision matters: In cases where precision matters one should try alternative strategies. Long-term experiments (eg. holdbacks) or multivariate experimentation are great at getting precise results while keeping teams moving.
Thanks to engin akyurt on Unsplash for the cool laundry photo!
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾
See how we’re making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾
Marketing platforms offer basic A/B testing, but their analysis tools fall short. Here's how Statsig helps you bridge the gap and unlock deeper insights. Read More ⇾