Underlying AB testing is the concept of “randomized controlled trials (RCTs).” It is the gold standard in finding causality.
Below is the famous hierarchy of evidences pyramid. Essentially, the only form of evidence that is stronger than RCTs is a meta study of RCTs. Presenting an RCT in an argument settles the argument.
There are two technical insights that enables the power of RCTs
With a large enough sample, randomization cancels out biases – this is called the law of large numbers. This makes sure that we don’t need to care about differences in the observable and unobservable variables with a large sample – randomization will take care of it.
With randomized assignments, the difference between the treatment group and the control group is caused by the treatment.
“Caused by the treatment” is a super strong statement. In most comparisons, studies without RCTs, the difference between two groups is usually a result of the selection bias instead of the treatment.
Let’s use one quick example, which also illustrates what “random assignment” is and its importance.
Suppose I claim that I have a magic pill that costs $100 and can increase the height of high school students by 1 inch over a year. I will show you two true results from my study:
Test group: 1000 students who voluntarily took the pill a year ago. Their average height was 60 inches a year ago and 62 inches this year.
Control group: 1000 students from the same schools with the same age. Their average height was 60 inches a year ago and 61 inches this year.
Can we conclude that this pill is effective? We all know that such a magic pill doesn’t exist, but what’s the loophole in this study?
The loophole in this study is “selection bias.” People are (self) selected into the treatment group. Those who volunteer into the study may come from wealthier families, as they can afford the pill, or they are more eager to grow taller and may have tried other things besides taking the pill. Any such factor will destroy the causality in this study.
But if we have 2000 students, then assign the pill randomly, we remove the select bias. By the law of large numbers, the average metrics (height, wealth, growth of height, eagerness to grow, etc.) of these two groups should be the same, and the difference in their height growth is guaranteed to be caused by the treatment – the pill.
Taking this example to product development, we can see why we can make such mistakes every day if we don’t have the mindset of AB testing. For example
Selection bias in time series:
Claim: We shipped a feature and metrics increased 10%
Reality: The metrics will increase 10% without the feature, such as shipping a Black Friday banner before Black Friday.
Selection bias in cross sections:
Claim: We shipped a feature, and users who use the feature saw 10% increase in their metrics
Reality: The users who self-select into using the feature would see a 10% increase without the feature, such as giving a button to power users (ref: why most aha moments are wrong?)
Beyond causality, AB testing is also a powerful measurement too. Peter Drucker said “If you can’t measure it, you can’t change it.” This is especially true in large companies with lots of management frictions.
Our customer story with Recroom is a great example. The company did a great UI revamp but saw a 30%+ decrease in their key metric. Without AB testing, they wouldn’t have noticed it.
Product development is not a one time work. It is a continuous iteration that accumulates small wins. But you can’t win if you can’t measure wins against losses. Once people start doing AB testing, they found out that 70% - 90% of their ideas actually don’t work.
Consequently, people who don’t do AB testing will ship many bad ideas without knowing it.
In short, AB testing is powerful and important because
Humans are bad at attributions and are subject to lots of biases
Humans are bad at predicting the outcome of their ideas
AB testing provides the necessary measurement and causality and keeps us honest with reality.
This past year marked one of the most dynamic periods in our history, as Vijaye Raji explains in his Significance Summit keynote.
We're stoked to announce that gates can now be tracked across all your environments, thanks to our users' requests. Read on to find out more!
A short list of reasons why a great experimentation tool is a horrible idea.
How we optimized Pod Disruption Budgets in Kubernetes to reduce resource waste and improve rolling updates for service deployments handling live traffic.
Statsig's AI Prompt Experiments allow you to run experiments for AI-powered products and gain real-time insights into what's working and what's not.
Master data-driven product development with Statsig. Simplify experimentation, make informed decisions, and accelerate your product's growth—all without complex coding.