Establishing baseline metrics: The starting point for every A/B test

Mon Jun 23 2025

You know that sinking feeling when your A/B test shows a 5% lift, but you have no idea if that's actually good? That's what happens when you skip baseline metrics.

I learned this the hard way at my last company - we ran tests for months without proper baselines, essentially flying blind. Turns out we were celebrating "wins" that were just normal fluctuations in our data. This guide will help you avoid that mess and actually understand what your test results mean.

Understanding the importance of baseline metrics in A/B testing

Baseline metrics are basically your performance snapshot before you start changing things. Think of them as your "before" photo in a fitness transformation - without them, you can't tell if you're actually improving or just experiencing normal ups and downs.

The Reddit community often debates what constitutes "too low" performance, and honestly, it's a fair question. You can't answer it without baselines. If your conversion rate is 2%, is that terrible? Depends - maybe your industry average is 1.5%, and you're actually doing great. Or maybe your competitors are hitting 5%, and you've got work to do.

Here's what baseline metrics actually do for you:

  • Give you a reality check on current performance

  • Help set realistic goals (no more "let's double revenue!" without data)

  • Show whether changes are meaningful or just noise

  • Keep you from celebrating fake wins

The Harvard Business Review team put it well when they noted that A/B testing transforms decision-making from gut feelings to data. But here's the thing - that transformation only works if you know where you started. Otherwise, you're just comparing one mystery number to another.

When picking your baseline metrics, focus on what actually matters to your business. Don't track 50 things just because you can. Choose metrics that align with real user needs and business goals. If you're an e-commerce site, conversion rate and average order value matter more than page views. If you're a content platform, engagement time might trump everything else.

Methods for establishing baseline metrics before testing

So how do you actually establish these baselines? Start with what you already have - your historical data. Pull up your analytics and look at your key metrics over the past 3-6 months. You're looking for patterns, not just averages. Does conversion rate tank on weekends? Does engagement spike during certain hours? This context matters.

A/A testing is another solid approach that the Statsig team recommends for validating your setup. Basically, you run a "test" where both variants are identical. Sounds pointless? It's not. This catches measurement issues before they mess up your real tests. If your A/A test shows a significant difference between identical variants, something's broken in your tracking or randomization.

Don't ignore the human element either. Numbers tell you what's happening, but not always why. Quick user interviews or surveys can reveal context you'd never see in quantitative data. Maybe that conversion dip isn't random - it's because your checkout process confuses people on mobile. A few conversations with actual users can save weeks of head-scratching over spreadsheets.

Here's a practical timeline for establishing baselines:

  1. Week 1-2: Pull and analyze historical data

  2. Week 2-3: Run A/A tests to validate your setup

  3. Week 3-4: Conduct user research to understand the "why"

  4. Week 4: Synthesize findings and document your baselines

Best practices in establishing baseline metrics

The biggest mistake I see? Using tiny sample sizes and calling it good. You need enough data for your baselines to actually mean something. The data science community on Reddit constantly debates sample sizes, and for good reason - get this wrong and your entire test is worthless.

Your sample needs to represent your actual audience. If 60% of your traffic is mobile but your baseline only includes desktop users, you're setting yourself up for confusion later. Same goes for seasonal patterns - don't establish baselines during Black Friday if you're planning to test in February.

Watch out for these common baseline pitfalls:

  • Selection bias: Cherry-picking good weeks for your baseline

  • Too short timeframes: Using one week of data when you have monthly cycles

  • Ignoring external factors: Not accounting for marketing campaigns or holidays

  • Static baselines: Never updating them as your product evolves

Your baselines aren't set in stone. Products change, users change, markets change. That baseline you established last year? Probably outdated. Review them quarterly at minimum, more often if you're in a fast-moving space. Netflix's engineering team talks about this constantly - what worked for their recommendation system five years ago would be laughable today.

The key is consistency in how you measure. Pick your methodology and stick with it. Changing how you calculate baselines mid-experiment is like switching thermometers halfway through taking someone's temperature - the numbers become meaningless.

Leveraging baseline metrics to improve A/B test outcomes

Once you have solid baselines, they become your secret weapon for running better tests. First up: sample size calculations. Without baselines, you're guessing how many users you need. With them, you can calculate exactly how long to run your test to detect meaningful changes.

Let's say your baseline conversion rate is 3% with a standard deviation of 0.5%. You want to detect a 10% relative improvement (to 3.3%). Plug those numbers into a sample size calculator, and boom - you know you need about 15,000 users per variant. No more ending tests early because you got impatient.

Baselines also help you spot when something's actually different versus just random noise. If your test shows a 2% lift but your baseline naturally fluctuates by 5%, that "win" is meaningless. This saves you from implementing changes that don't actually help - and trust me, reverting features is way harder than not shipping them in the first place.

Here's how baselines improve your testing process:

  • Calculate required sample sizes accurately

  • Set meaningful success criteria

  • Identify true winners vs. statistical noise

  • Prioritize high-impact tests

  • Learn faster by understanding variance

The product management community sometimes asks whether you need to test every single change. The answer? No, and baselines help you decide what's worth testing. If a change can't possibly move your metric beyond normal variance, skip the test. Focus your testing resources on changes with real potential impact.

As you run more tests, your baselines become increasingly valuable. They help you ask better questions. Instead of "Does this button color matter?" you can ask "Given our baseline 3% CTR with 0.2% daily variance, will this design change drive at least a 15% improvement?" That's a testable hypothesis, not a fishing expedition.

Closing thoughts

Baseline metrics aren't the sexiest part of A/B testing, but they're the foundation everything else builds on. Without them, you're essentially gambling with your product decisions. With them, you can make confident, data-driven choices that actually move the needle.

Start simple - pick 3-5 key metrics, gather a few months of historical data, and document what "normal" looks like for your product. Run an A/A test to make sure your tools work properly. Talk to some users to understand the story behind the numbers. This upfront investment pays dividends in every future test you run.

Want to dive deeper? Check out Statsig's guide on running effective A/B tests, or explore how companies like Google and Netflix approach experimentation at scale. The key is to start somewhere - even imperfect baselines beat no baselines.

Hope you find this useful!

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy