Ever run an A/B test that showed "statistical significance" but left you wondering if the change actually mattered? You're not alone - this happens all the time when teams focus solely on p-values without considering effect sizes.
That's where Cohen's d comes in. It's a simple way to measure how big your experimental effect actually is, helping you figure out whether that 0.5% conversion lift is worth celebrating or if it's just statistical noise.
Effect size tells you how much something matters, not just whether it exists. While p-values answer "is there a difference?", effect sizes answer "how big is the difference?" This distinction becomes crucial when you're trying to decide whether to ship a feature or allocate engineering resources.
Cohen's d specifically measures the difference between two groups in terms of standard deviations. Think of it this way: if your control group averages 100 conversions with a standard deviation of 20, and your treatment group averages 110 conversions, Cohen's d would be 0.5 (the 10-point difference divided by the standard deviation). This standardization lets you compare effects across different metrics - you can finally answer whether a 2% increase in signups is more impactful than a 5% decrease in churn.
The beauty of Cohen's d is its simplicity. You don't need fancy statistical software or a PhD to use it effectively. Jacob Cohen, who developed this measure, even gave us handy benchmarks: 0.2 for small effects, 0.5 for medium, and 0.8 for large. But here's the thing - these are just starting points. What counts as "large" in pharmaceutical trials might be "trivial" in user experience testing.
The calculation itself is straightforward: take the difference between your two group means and divide by the pooled standard deviation. That pooled standard deviation is just a weighted average of both groups' standard deviations - nothing magical about it.
Here's what you need:
Mean of group 1
Mean of group 2
Standard deviation of each group
Sample size of each group (for the pooled calculation)
The formula looks like this: d = (M₁ - M₂) / sp, where sp is that pooled standard deviation. Most analytics platforms will calculate this for you, but understanding what's happening under the hood helps you spot when something looks off.
Those Cohen benchmarks I mentioned? They're useful but not gospel. A "small" effect of 0.2 might be huge if you're Netflix and even tiny improvements affect millions of users. Conversely, a "large" effect of 0.8 might be meaningless if your sample size is 20 users.
The team at Statistics by Jim points out that context is everything. In education research, a Cohen's d of 0.4 might represent a year's worth of learning progress. In tech, that same 0.4 might be the difference between a feature that barely moves the needle and one that transforms your product.
What really matters is developing intuition for your specific domain. Start tracking Cohen's d values for all your experiments. After a few months, you'll know what "normal" looks like for your metrics and can spot truly exceptional results.
Cohen's d shines when you need to compare apples to oranges - or more realistically, when you need to compare conversion rates to page load times. By standardizing everything into effect sizes, you can make meaningful comparisons across different types of metrics.
Let's say you're running multiple experiments:
Experiment A: 3% lift in purchases (Cohen's d = 0.15)
Experiment B: 8% reduction in support tickets (Cohen's d = 0.35)
Experiment C: 12% increase in session duration (Cohen's d = 0.10)
Without Cohen's d, you might think Experiment C is the winner because 12% sounds impressive. But the effect size tells a different story - Experiment B has the strongest actual impact, even though the percentage change is smaller.
This becomes especially valuable when you're prioritizing what to build next. As discussed on Reddit's statistics community, teams often get caught up in statistical significance while ignoring whether the effect is worth pursuing. Cohen's d keeps you honest about what actually moves the needle.
When you're looking at results across multiple experiments or trying to learn from other companies' tests, Cohen's d becomes essential. It's the universal translator of experimental results.
Say you want to understand the impact of reducing page load time. You find:
Amazon's study: Every 100ms delay cost 1% in sales
Google's research: 500ms delay caused 20% drop in traffic
Your own test: 200ms improvement increased conversions by 5%
These results use different metrics and scales. But convert them all to Cohen's d, and suddenly you can see which interventions had the biggest real impact. This is exactly how teams at Statsig aggregate learnings across thousands of experiments - standardizing everything into comparable effect sizes.
Don't let Cohen's d become another vanity metric. It's tempting to chase big effect sizes, but sometimes small, consistent improvements compound into major wins. The key is balancing effect size with other factors:
Sample size matters: A tiny effect can be valuable if it applies to millions of users
Cost considerations: A medium effect might not be worth it if implementation is expensive
User experience: Some changes improve metrics but frustrate users (classic dark pattern territory)
The folks over at r/AskStatistics often debate what constitutes a "trivial" effect size. The answer? It depends entirely on your context. For a startup fighting for every user, a Cohen's d of 0.1 might be worth celebrating. For an established product, you might ignore anything under 0.3.
Here's my advice: establish your own benchmarks. Track Cohen's d for all your experiments for a quarter. Calculate the median and quartiles. Now you have real baselines for what "small," "medium," and "large" mean for your specific product and users.
One final gotcha: Cohen's d assumes your data is roughly normal. If you're dealing with heavily skewed metrics (like revenue per user, where a few whales dominate), consider alternatives like odds ratios or log transformations first.
Cohen's d isn't magic - it's just a tool that helps you think more clearly about your experimental results. Start simple: calculate it for your next A/B test and see what insights emerge. You might be surprised to find that some of your "winning" experiments have tiny effects, while some "failures" were actually moving things in the right direction.
If you want to dive deeper, check out Statsig's guide to statistical significance or explore how different fields approach effect size interpretation. The more you work with effect sizes, the better your experimental intuition becomes.
Hope you find this useful!