Pitfalls of Multi-arm Experiments

Platform

Developers

Resources

Pricing

Platform

Developers

Resources

OVERVIEW

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Feature Flags Liberated

Gating features is a core part of the development process. And with Statsig, it's free.

How AI Companies Use Statsig

The best AI companies use Statsig to accelerate growth. Learn how you can do the same.

What is Product Observability?

Product observability means being able to monitor, control, and gain insight into all of your features.

Platform

Developers

Resources

Pricing

OVERVIEW

Statsig Blog

Peak Velocity is our blog where we cover the latest in experimentation and more

Feature Management

Ship faster and more confidently

Experimentation

Run 100s of randomized, multivariate experiments

Data Warehouse

Run experiments natively, in your warehouse

Analytics

Actionable intelligence at your fingertips

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Experiments with Generative AI

We built a generative AI app in reactJS using OpenAI’s API and Statsig. Here’s what we learned:

Experimentation Platforms

The decision to build versus buy an experimentation and feature flagging platform is not an easy one.

CUPED Explained

CUPED is an implementation that uses pre-experiment data to explain the variance in the result data.

Pitfalls of Multi-arm Experiments

Tue May 04 2021

Dealing with Significance (⍺) for Multiple Test Groups

As companies become more comfortable with running AB tests, they often consider simultaneously testing more than one variation. It’s called a multi-arm experiment, also known as multi-group and A/B/n (ABn) testing (but should not be confused with multi-armed bandit experiments, hopefully the topic of a later blog post).

Multi-arm testing involves two or more test groups (eg. A and B), and a control group (eg. C). This allows comparing A against C at the same time as B against C, thus reusing your control group. This also affords a head-to-head comparison between the test groups (A vs B) to evaluate differences or identify a clear statistical winner.

This is a really powerful tool in an experimentalist’s arsenal that can reduce sample size, costs, and time, while testing multiple hypotheses in parallel. However, there is one major pitfalls to watch out for.

Significance is Much More Significant

When I worked in drug research, our biologists would present in vitro (test tube) results during our weekly meetings. Their slides would show the results from the 20 or so latest experimental drug molecules, all compared to a control molecule (typically a competitor’s drug) complete with p-values and confidence intervals. Every week we got excited over any molecule which showed statistically-significant results, but were frequently disappointed when we couldn’t reproduce those results the following week. This is called a Type I error (also known as a false positive) and occurs when a significant difference arises due to statistical randomness and not any actual meaningful difference.

We shouldn’t have been surprised; We set our significance level (⍺) to 0.05 which means each comparison had a 5% chance of showing significant results due to random chance when no difference actually exists. Testing 20 compounds per week all but guaranteed we would make this error weekly. There are two common solutions to this problem, but I recommend a third, more practical solution.

Solution #1: Apply a Bonferroni correction

Most statisticians (and textbooks) will suggest you apply a Bonferroni correction to the significance level. You do this by lowering the significance level (⍺) by dividing it by the number of comparisons (or number of test groups). If you are running two test groups with ⍺ = 0.05, you should cut your significance by two, to 2.5%. If you are running 20 trials, you should cut your significance level by 20, from 5% to 0.25%. This lowers the chance of a false positive on any individual comparison but maintains the overall type I error rate across the experiment. In my drug research example above, this means we would have made a Type I error once every 20 weeks, instead of almost every week. This doesn’t come for free, applying the Bonferroni correction raises the chance of making a Type II error (false negative) where a material difference goes undetected.

The Bonferroni correction is risk-adverse solution that errs on avoiding Type I errors at the expense of Type II errors.

Solution #2: Repeat your experiment

This is what we did in my drug discovery anecdote, we retried the experiment to confirm results. The chance of making another Type I error is 5% (⍺). The chance of making 2 Type I errors in a row is 0.25% (⍺²). Repeating the experiment is the best way when experiments are cheap and quick. But product development experiments take weeks, if not months, and reproducing results may not be possible if you don’t have a fresh batch of unexposed users.

Solution #3: Lower your Type II error rates by accepting a higher Type I error rate (my recommendation)

In product development, it’s competitively important to gather directional data to make quick decisions and improve the product for your users. All hypothesis testing involves trading off accuracy for speed and it’s important to be thoughtful about how we set the statistical test parameters (ie. ⍺). It’s also important to understand that lowering Type I errors comes at the cost of raising Type II errors.

The Bonferroni correction (solution #1) purposely trades off Type I errors against Type II errors. When you’re making drugs that could kill people, taking a cautious stance is absolutely warranted. But deciding whether to go with the red or blue button, or the new shopping layout vs the old presents a very different risk. Repeating experiments (solution #2), typically takes too much time. And if your first experiment was ramped up to a large percentage of your user base, it’ll be a challenge to find untainted users.

My recommendation is to consider whether you are comfortable with increasing your Type I error rates in favor of not missing an actual difference (Type II errors). Such an approach can be applied to any hypothesis test, but in my experience, is especially relevant to multi-arm experiments. I’ve typically seen multi-arm experiments deployed when there’s evidence or a strong belief that the control/default experience needs to be replaced. Perhaps you want to test a 2.0 of your UI redesign, but have 3 different versions. Or you want to add a new feature to your ML algorithm, but want to pick the ideal tuning parameters. In these scenarios, it may not make sense to give the control group an unfair advantage by lowering your significance level (⍺).

Increasing Type I error rates is most suitable when:

You have prior knowledge or data that the control group is suboptimal.
The real objective of the experiment is to determine the best test group.
Your team/company is already committed to making a change.

At the same time, you don’t want to be at the mercy of statistical noise. This can thrash your user experience, trigger unknown secondary effects, and/or create extra product work. As a rule of thumb, if you use ⍺=0.05, you should feel comfortable running up to 4 variations. This slightly biases you towards making a change, but keeps the overall Type I error rates at a reasonable level (under 20%). If you want to try more variations, I do suggest raising your bar or you’ll end up with Type I errors occurring more frequently than not. 10 experiments with ⍺=0.05 results in Type I errors occurring 40% of the time (but you can call that 10 x 0.05 = 50% for simplicity).

Rule of thumb: Up to 4 variations can be run at a significance level of ⍺=0.05; any more and you should probably lower your significance threshold.

Conclusion

I recommend using ⍺=0.05 by default, but there are situations where it’s worth changing it. Multi-arm experiments can be such a situation and it’s important to acknowledge and understand the tradeoff between Type I and Type II errors. If you want to be cautious and maintain your Type I error rates at 5%, use a Bonferroni correction but realize you’re increasing your Type II error rates. I suggest maintaining ⍺=0.05 for individual comparisons when running 4 or fewer test groups in a multi-arm experiment, particularly when you don’t want bias the results too strongly to the control group.

Interested in running multi-arm experiments? Statsig can help. Check us out at https://statsig.com. May all your tests be appropriately significant.

Thanks to Jonathan Chng on Unsplash for the runners photo!

Featured

Actionable intelligence at your fingertips

With Statsig Analytics you can get answers in just a few clicks. No queries required.

Stay ahead of the curve

Get experimentation insights in your inbox!

Permalink: https://www.statsig.com/blog/pitfalls-of-multi-arm-experiments

Try Statsig Today

Get started for free. Add your whole team!

Start for Free

Platform

Developers

Resources

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Feature Flags Liberated

How AI Companies Use Statsig

What is Product Observability?

Platform

Developers

Resources

Pricing

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Experiments with Generative AI

Experimentation Platforms

CUPED Explained

Back to blog home

Pitfalls of Multi-arm Experiments

Tim Chan

Dealing with Significance (⍺) for Multiple Test Groups

Significance is Much More Significant

Solution #1: Apply a Bonferroni correction

Solution #2: Repeat your experiment

Build fast with Be Significant
Our exclusive startup program

Build fast with Be Significant
Our exclusive startup program