Equivalence testing: Proving similarity

Mon Jun 23 2025

You've probably been there before - staring at test results that show no statistically significant difference and wondering if that means your new feature is actually equivalent to the old one. Here's the thing: traditional hypothesis testing is designed to find differences, not prove similarity.

That's where equivalence testing comes in. Instead of asking "are these two things different?", it flips the question to "are these two things similar enough?" It's a game-changer for anyone who needs to prove their new solution performs just as well as the existing one - whether that's a cheaper drug alternative, a faster algorithm, or a redesigned checkout flow.

Understanding equivalence testing

Think of equivalence tests as the statistical equivalent of proving your generic cereal tastes just as good as the name brand. Traditional tests are all about finding differences, but sometimes what you really need to know is whether two things are practically the same. This matters huge in clinical trials - imagine you've developed a cheaper version of an expensive medication. You don't need to prove it's better; you just need to show it works just as well.

The whole approach shifts the burden of proof. Instead of starting with "these are the same" and trying to prove they're different, you start with "these are different" and work to prove they're similar within a specific range. That range - your equivalence limits - is basically your tolerance for what counts as "close enough." If you're testing whether a new API endpoint is as fast as the old one, maybe anything within 50ms is fine for your users.

Here's why this matters in the real world. Let's say you're running an A/B test at Statsig and your new feature shows a tiny 0.5% decrease in conversion. Is that actually worse, or just noise? With equivalence testing, you can definitively say "this new feature performs equivalently to the old one" if that 0.5% drop falls within your predefined acceptable range. It's particularly powerful for non-inferiority trials where you're trying to prove something new (often cheaper or easier to implement) isn't worse than what you already have.

Methods and techniques in equivalence testing

The go-to method for equivalence testing is the Two One-Sided Tests (TOST) procedure. Don't let the name scare you - it's actually pretty straightforward. You pick your equivalence bounds (how much difference you're willing to tolerate), then run two one-sided tests to check if your observed effect falls within those bounds. Think of it like setting up guardrails: if your result stays between them, you've got equivalence.

Setting those bounds is where things get interesting - and where a lot of people mess up. You can't just pick numbers out of thin air. The bounds need to reflect what actually matters in your context. If you're testing recipe variations in a food study, maybe a 10% difference in taste scores is acceptable. But if you're validating a new blood pressure monitor, even a 2% difference might be too much.

Confidence intervals offer another angle on equivalence testing. Here's the trick: calculate your confidence interval around the observed difference. If that entire interval fits within your equivalence bounds, you're golden. As researchers in clinical trials have shown, this visual approach helps teams understand not just whether equivalence exists, but how confident they can be about it.

The beauty is that equivalence testing works across tons of scenarios:

  • Validating that a new measurement tool gives similar results to the gold standard

  • Proving a simplified algorithm performs as well as a complex one

  • Showing that removing a feature doesn't hurt user experience

  • Confirming that different manufacturing batches produce consistent quality

Just remember: those equivalence bounds are everything. Teams at Minitab discovered that poorly chosen margins can make your test either too lenient (claiming equivalence when there isn't any) or too strict (missing real equivalence). The key is grounding your bounds in practical significance, not statistical convenience.

Practical considerations and challenges

Picking the right equivalence margins is where theory meets reality - and where things often go sideways. You need domain expertise to set meaningful bounds. Statistical forums are full of people asking "how tight should my bounds be?" The answer always depends on context. A 5% margin might be huge for a financial algorithm but trivial for a recommendation system.

Things get messier when you're running multiple equivalence tests. Let's say you're testing whether a new feature performs equivalently across five different user segments. Run five separate tests and you've got a multiplicity problem - your chance of falsely claiming equivalence somewhere goes up. The stats community typically reaches for Bonferroni corrections here, though that can make your tests overly conservative.

Testing more than two groups adds another layer of complexity. The standard TOST approach works great for comparing A to B, but what about A to B to C? Researchers have proposed various extensions, though most involve:

  • Running pairwise comparisons with adjusted alpha levels

  • Using multivariate methods that test all groups simultaneously

  • Defining a reference group and testing all others against it

In regulatory settings, equivalence testing isn't just useful - it's often required. The FDA and similar bodies rely heavily on these tests when approving generic drugs or biosimilars. The stakes are high, which is why setting appropriate margins becomes crucial. Too wide and you risk approving inferior products; too narrow and you block perfectly good alternatives.

The biggest trap? Using margins based on what gives you the result you want rather than what makes practical sense. Teams often discover this the hard way when stakeholders question why a "statistically equivalent" change still caused customer complaints. Your bounds should come from:

  • Historical data on meaningful differences

  • Expert opinion from people who understand the domain

  • Pilot studies that reveal natural variation

  • Regulatory guidelines when they exist

Applying equivalence testing in experimentation

Here's where equivalence testing really shines - in the day-to-day grind of A/B testing and product development. Picture this: you've simplified your checkout flow to reduce code complexity, but you need to prove it doesn't hurt conversion. Traditional significance testing might show "no significant difference," but that's not the same as proving equivalence. With equivalence testing, you can confidently say "the new flow converts within 1% of the original" - exactly what your PM needs to hear.

Validating new tools and methods is another killer use case. Say you're switching from an expensive third-party analytics service to an in-house solution. You run both in parallel for a month, then use equivalence testing to prove the measurements are practically identical. Now you've got data-backed justification for the switch, not just a gut feeling that "they look pretty similar."

The same principle applies to product updates and migrations. Every time you refactor code, update dependencies, or migrate infrastructure, you're essentially betting that nothing important changes. Equivalence testing lets you validate that bet. Run tests before and after, set reasonable bounds for key metrics, and prove that your changes didn't break anything users care about.

At Statsig, teams use equivalence testing to validate platform changes don't affect experiment results. It's one thing to say "our new stats engine should give the same results" - it's another to prove it with rigorous equivalence tests across thousands of historical experiments.

Setting bounds in practice requires balancing statistical rigor with business reality:

  • Start with stakeholder input: What difference would actually matter to users or the business?

  • Look at historical variation: How much do your metrics naturally fluctuate?

  • Consider the cost of being wrong: Tighter bounds for critical metrics, wider for nice-to-haves

  • Document your reasoning: Future you will thank present you for writing down why you chose those bounds

The actual testing can use TOST or confidence interval approaches - pick based on what your team finds clearer. Just remember that the interpretation matters as much as the math. "Equivalent within 2%" means nothing if stakeholders expected "identical." Set expectations early and often.

Closing thoughts

Equivalence testing fills a critical gap in the experimenter's toolkit. While traditional tests excel at finding differences, sometimes you need to prove that things are similar enough to treat as the same. Whether you're validating a new implementation, running non-inferiority tests, or just trying to show that removing complexity didn't break anything, equivalence testing gives you the statistical backing to make those claims confidently.

The key is getting those equivalence bounds right - they should reflect real-world importance, not mathematical convenience. Take the time to understand what "close enough" actually means in your context, and you'll find equivalence testing becomes an invaluable part of your decision-making process.

Want to dive deeper? Check out Lakens' practical primer on equivalence testing or explore how platforms like Statsig handle equivalence in their experimentation framework. The stats might seem daunting at first, but once you've run your first successful equivalence test, you'll wonder how you ever shipped changes without it.

Hope you find this useful!

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy