Effect size: Practical vs statistical significance

Mon Jun 23 2025

Ever had that moment where your A/B test comes back "statistically significant" but the actual improvement is so tiny you can barely see it? You're not alone. I've watched teams celebrate p-values under 0.05 while ignoring that their "winning" variant only moved the needle by 0.1%.

This disconnect between what statistics tell us and what actually matters in the real world is one of the biggest traps in experimentation. Let's talk about why you need both statistical and practical significance to make decisions that actually move your business forward.

Introduction to statistical and practical significance

Here's the thing about p-values - they're just telling you whether your results happened by chance. A p-value under 0.05 means there's less than a 5% probability that your observed difference was a fluke. But that says nothing about whether the difference actually matters.

That's where effect size comes in. While statistical significance answers "is this real?", practical significance (measured by effect size) answers "does this matter?" It's the difference between finding out that your new homepage technically converts better versus finding out it converts 15% better. One gets you a pat on the back at standup; the other gets you promoted.

The researchers at Scribbr put it well: effect size quantifies the actual magnitude of the difference between groups. It's the context that makes your p-values meaningful. Without it, you're essentially flying blind - you know something changed, but not whether it's worth acting on.

I learned this lesson the hard way when I spent three months optimizing a checkout flow. Every test was statistically significant (we had tons of traffic), but the cumulative impact? A whopping 0.8% improvement. Had I looked at effect sizes from the start, I could have focused on bigger opportunities.

The key insight is this: p-values depend heavily on sample size, while effect sizes don't. Run any test long enough with enough users, and you'll eventually get statistical significance. But that doesn't mean you should reorganize your entire product based on the results.

The limitations of relying solely on statistical significance

Let me paint you a picture. You're running an experiment with a million users, and your test comes back significant at p < 0.001. The team's excited. Champagne pops. But then you look closer - the actual difference in conversion rate is 20.1% versus 20.0%.

This is the dark side of large sample sizes: they can make meaningless differences look important. The team at Reddit's statistics community sees this confusion constantly. Statistical significance just tells you the effect probably isn't zero. It doesn't tell you if the effect is worth caring about.

Here's what typically happens:

  • You run a test with thousands of users

  • The p-value comes back significant

  • You implement the change across the board

  • Three months later, nobody can see any real impact

The University of North Texas's research methods guide shares a perfect example: a study might find that a new teaching method improves test scores by 0.5 points on a 100-point scale. Statistically significant? Sure. Worth overhauling your entire curriculum? Probably not.

The solution is deceptively simple: always calculate effect sizes. Measures like Cohen's d tell you how big the difference actually is, independent of sample size. Once you start thinking in terms of effect sizes, you'll make better decisions about what to test, what to implement, and what to ignore.

Understanding and calculating effect sizes

Effect sizes sound intimidating, but they're actually pretty straightforward. Cohen's d, the most common measure, is basically asking: "How many standard deviations apart are my two groups?" If your control group averages 100 conversions with a standard deviation of 10, and your variant gets 105 conversions, that's a Cohen's d of 0.5 - what researchers typically call a "medium" effect.

But here's where it gets interesting. Cohen's original guidelines suggest:

  • Small effect: d = 0.2 (barely noticeable)

  • Medium effect: d = 0.5 (visible to the naked eye)

  • Large effect: d = 0.8 (you'd have to be blind to miss it)

The catch? These benchmarks aren't universal. What counts as a "large" effect in psychology research might be tiny in your product metrics. A 0.2% improvement in click-through rate could be huge for Google's search results but meaningless for your startup's landing page.

Pearson's r works differently - it measures correlation strength rather than group differences. Perfect for understanding relationships like "do users who engage with feature X tend to convert more?" Values range from -1 to 1, where 0 means no relationship at all.

The team at ScienceDirect found that most researchers still don't report effect sizes, which is wild considering how crucial they are for meta-analyses and power calculations. If you want your experiments to build on each other, you need effect sizes to compare results across different sample sizes and time periods.

Integrating statistical and practical significance in decision-making

So how do you actually use both metrics without getting analysis paralysis? Start by setting your thresholds before you run the test. Decide what effect size would make implementation worthwhile - this forces you to think about practical impact upfront.

Here's my framework:

  1. Calculate your minimum detectable effect - what's the smallest change that would justify the effort?

  2. Run your test until you hit statistical significance - but don't stop there

  3. Check the effect size and confidence intervals - is the impact big enough to care about?

  4. Consider the context - a small effect on a high-traffic feature might beat a large effect on something rarely used

The biomedical researchers publishing in PMC have this down to a science. They report effect sizes, confidence intervals, and p-values together because each tells part of the story. The p-value says "this is probably real," the effect size says "this is how big it is," and the confidence interval says "it's probably somewhere in this range."

The best teams I've worked with treat effect size as the primary metric and statistical significance as a quality check. They'll say things like "we're looking for at least a 5% improvement (d = 0.3), and we need to be 95% confident it's not a fluke." This clarity transforms endless debates into straightforward decisions.

At Statsig, we see this play out daily - teams that focus solely on p-values tend to ship lots of tiny "improvements" that add up to nothing. Teams that balance both metrics ship fewer changes, but each one actually moves the needle. One approach fills your roadmap; the other fills your revenue targets.

Closing thoughts

Next time someone shows you a "statistically significant" result, ask them about the effect size. It's the difference between knowing something changed and knowing whether that change matters. Your users won't notice a 0.1% improvement, no matter how many stars your p-value has.

Want to dive deeper? The Open Science Foundation has excellent resources on effect size reporting, and if you're using Statsig for your experiments, the platform automatically calculates both statistical and practical significance metrics for every test.

Hope you find this useful! Remember - in the real world, practical beats statistical every time.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy