Don’t be a Holdout holdout

Platform

Developers

Resources

Pricing

Platform

Developers

Resources

OVERVIEW

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Feature Flags Liberated

Gating features is a core part of the development process. And with Statsig, it's free.

How AI Companies Use Statsig

The best AI companies use Statsig to accelerate growth. Learn how you can do the same.

What is Product Observability?

Product observability means being able to monitor, control, and gain insight into all of your features.

Platform

Developers

Resources

Pricing

OVERVIEW

Statsig Blog

Peak Velocity is our blog where we cover the latest in experimentation and more

Feature Management

Ship faster and more confidently

Experimentation

Run 100s of randomized, multivariate experiments

Data Warehouse

Run experiments natively, in your warehouse

Analytics

Actionable intelligence at your fingertips

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Experiments with Generative AI

We built a generative AI app in reactJS using OpenAI’s API and Statsig. Here’s what we learned:

Experimentation Platforms

The decision to build versus buy an experimentation and feature flagging platform is not an easy one.

CUPED Explained

CUPED is an implementation that uses pre-experiment data to explain the variance in the result data.

Don’t be a Holdout holdout

Wed May 04 2022

An opinionated guide on using Holdouts

Feature Level Holdouts

Good teams move fast. They’re trying several ideas at any given time. When they find something that works, they ship it and find the next idea to try. With some features — it is useful to measure if individual wins sustain after prolonged exposure (e.g. adding app badging or notifications).

With other features (e.g. showing ads), there may be no short term effect, but you want to understand long term effects. You do this by creating a holdout. You keep the feature away from 1% of your users — and measure the difference between this group and the other 99% after several months. This helps ensures you’re building for the long term and not over-optimizing for short-term wins.

Measuring Cumulative Impact

Another key use case is with measuring cumulative impact across many features. If a new shopping app ships 10 features over a quarter, each showing a 2% increase in total revenue — it’s unlikely they’d see a 20% increase at the end of the quarter. There’s often statistical noise with feature level measurement, and some interaction and cannibalization across features. You may end up with only a 12% total win from the quarter you shipped 10 features in. Creating a holdout across these features lets you measure the actual impact by keeping a small set of users (1–2%) who don’t get new features during this period — and comparing metrics for them relative to users who did get everything you chose to ship.

At Facebook, most product team on the core app calculate the cumulative impact of all features shipped over the last 6 months. This aligns with their goal setting and performance review process. At the start of every period — they create a small holdout (1–5% of users). At the end of the half, they measure the impact of the features they shipped by comparing metrics against the holdout group. They release the holdout, and start a new one for the next half.

Team or product level holdouts are powerful. You can tease apart the impact of external factors (e.g. your competitor going out of business) and seasonality (atypical events including holidays, unusual news cycles or weather) from the impact driven by your feature launches. You can also measure long-term effects and quantify subtle ecosystem changes.

Costs

1. Engineering overhead. For each feature you ship with a holdout, you’re committing to support an if-then-else fork in your code. For a fast moving team, having to support multiple code paths makes the test and debug matrix large. Shorter holdouts help make this manageable. When your legacy code path breaks — if you don’t find and fix it swiftly, your holdout results become untrustworthy.

Typically once an experiment is shipped (or a feature finishes rollout) — you go back and clean up your code to remove the branching logic that checks the experiment or feature gate status. When using holdouts — you save this cleanup till when the holdout is retired. Many teams will make a focused push with a few engineers — instead of asking each engineer that shipped a feature to clean up the code base.

2. Opportunity cost. When you ship features that increase revenue or retention, a large holdout means leaving those gains on the table. There’s also dissatisfaction you cause when someone sees a friend with a shiny new feature that they don’t have.

One of the most expensive holdouts Facebook runs is a long term Ads holdout. Yes — there are a set of people that get to use Facebook without advertisements! FB values this because it helps them measure the costs of ads on engagement. It also helps them isolate the impact of ad specific bugs.

3. Monitoring. Holdouts are typically analyzed in detail only at the end of the holdout period. It’s useful to check in on them at a regular cadence (e.g. monthly) to make sure there isn’t anything unexpected that may taint the holdout. A broken control variant impacting only 1% of your users can make the Holdout useless if you only detect it at the end of the holdout period. There’s little point comparing metrics between users with new features and users with a broken experience. The act of checking the performance of the Holdout group can spawn investigations to understand unexpected movements.

4. Users, Customer Support & Marketing. For people in the holdout, it is confusing to see friends get a spiffy new feature, while you don’t. It’s important to retire and create new holdouts every cycle — so it’s not the same set of users punished again and again.

Customer support needs to easily diagnose users who complain about a missing feature that’s publicly available. Marketing splashes about a new feature need to be careful if the holdout is unreasonably large and is likely to cause negative sentiment.

Setting up

The previous section outlined some key costs. Holdouts are not cheap. To make sure you get value from your holdout some tips include

Have a clear set of questions the holdout is designed to answer. This will guide your design, holdout duration, value you get and will dictate what costs make sense to incur.

When Facebook shipped game streaming, they shipped a test that invited people to join the streamer’s community when watching a streamer’s video. Four weeks in, the topline results were neutral. More people joined communities, but the business metrics hadn’t moved.
The team was convicted that this was the right thing to do by users and shipped the feature with a small holdout. Four months later the holdout helped measure a double digit increase in topline metrics from this feature.
Building communities takes time. If you have conviction in your feature, launching and using Holdouts to validate intuition lets you move fast, while validating progress with time.

2. Use a power analysis calculator to size the holdout. Holdouts measure over a longer period of time — so make them as small as is reasonable. Keep in mind that the final readout of the holdout will likely aggregate the metrics over the last 1–4 weeks rather than the entire holdout period. This captures the final impact of all the shipped features.

When Instagram started adding in advertising, they started tentatively. Because they had a small ad load, they sized a large Holdout so it was sensitive to small effects. When the ad business grew, they realized that the Holdout was oversized and way too expensive relative to it’s value. They ended up shrinking it dramatically. This was a non-trivial task (took months to validate and launch, with hacky changes across multiple codebases) and quite risky as it could have ended with the main holdout the company used to validate the impact of ads becoming compromised. It’s a good reminder to think through the long term impact of maintaining a holdout, from engineering cost to actual business cost, and factor that into the decisions from sizing, scoping, to whether or not you should even create one.

3. Understand the costs associated with holdouts. Make sure teams that will pay those costs understand holdout goals and buy in.

4. Getting a holdout wrong is very expensive. You write the bad holdout off and have to wait a quarter or a half for your next try. Optimize for simple and reliable over sophisticated and complex. If you’re just getting started with holdouts, re-read this bullet again.

When not to

1. Infra changes and bug fixes. These tend to be poor candidates for holdouts. The cost of supporting new and old infra can outweigh the benefits of doing this. Holding back bug fixes knowingly gives users broken experiences.

2. Cross-user features. If your feature requires that others also have the feature for it to work, you’re breaking the feature if you have a holdout. E.g. if you ship collaborative editing in a business productivity app — you’re better off holding out some organizations from this feature instead of keeping a small percentage of users in every organization from this and breaking the feature for them and their team mates.

3. No org commitment. See section on costs. Holdouts require commitment, and if the questions your Holdout is designed to answer aren’t a priority for your business, you’re better off skipping this.

4. Backtests. There are a set of features where backtests are a better efficient way to measure impact. A backtest is effectively an after-the-fact holdout. You take back a feature from a small set of users and then compare their metrics to everyone else to quantify the impact.

Backtests make sense when you’re happy with the result of a feature but want to make sure it reproduces (or make sure some negative guardrail impact doesn’t reproduce). With these you’re not as worried that short term vs long term effect.

This works best for infra changes that aren’t user visible — when they won’t see a feature disappear on them.

KISS

In summary — move fast, be inventive, run many experiments. If you get one of many wrong, you’ve lost only a few weeks of data collection. With holdouts, be measured. A bad holdout can cost you months of data collection before you realize it. Start simple to optimize for success. After you’ve found success with simple holdouts, evolve these to support more ambitious goals.

Featured

Statsig for startups

Statsig offers a generous program for early-stage startups who are scaling fast and need a sophisticated experimentation platform.

Stay ahead of the curve

Get experimentation insights in your inbox!

Permalink: https://www.statsig.com/blog/getting-in-on-holdouts

Try Statsig Today

Get started for free. Add your whole team!

Start for Free

Platform

Developers

Resources

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Feature Flags Liberated

How AI Companies Use Statsig

What is Product Observability?

Platform

Developers

Resources

Pricing

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Experiments with Generative AI

Experimentation Platforms

CUPED Explained

Back to blog home

Don’t be a Holdout holdout

Vineeth Madhusudanan

An opinionated guide on using Holdouts

Feature Level Holdouts

Measuring Cumulative Impact

Costs

Build fast with Be Significant
Our exclusive startup program

Build fast with Be Significant
Our exclusive startup program