Difference-in-differences: Causal product inference

Mon Jun 23 2025

Ever launched a product feature and wondered if that spike in engagement was actually because of your change - or just coincidence? You're not alone. Most product teams struggle with this exact question, relying on A/B tests that sometimes miss the bigger picture.

Here's where difference-in-differences (DiD) comes in. It's a clever way to figure out what really caused that change in your metrics when running a traditional experiment isn't possible. Think of it as detective work for data scientists.

The need for causal inference in product development

Let's be honest - we've all been there. You ship a new feature, see metrics go up, and pat yourself on the back. But deep down, you know correlation isn't causation. Maybe those users were just more engaged to begin with.

Standard A/B tests are great when you can randomize everything perfectly. But real life is messy. What if you can't randomly assign users? What if external factors are at play? This is where DiD shines.

Here's how it works: instead of just comparing treatment and control groups, you compare how each group changed over time. It's like watching two runners - one gets new shoes, one doesn't. You don't just check who's faster after; you check how much each improved from their baseline.

The magic happens in that double difference. First, you calculate the change for your treatment group. Then you calculate the change for your control group. The difference between these differences? That's your actual causal effect.

Matheus Facure puts it well in his causality handbook - the key is assuming both groups would have followed similar paths without the intervention. It's called the parallel trends assumption, and it's basically saying "these groups were moving in the same direction before we changed anything."

Demystifying the difference-in-differences methodology

The coolest origin story in statistics might belong to DiD. Back in the 1850s, John Snow (not the Game of Thrones guy) used a primitive version to prove cholera came from contaminated water. He compared death rates between neighborhoods with different water suppliers - basically inventing modern epidemiology while solving a public health crisis.

So what makes DiD tick? Three big assumptions:

  • Parallel trends: Without treatment, both groups would've moved similarly

  • No other shocks: Nothing else happened to just one group during your study

  • Stable composition: The groups themselves didn't change dramatically

Getting these wrong is where most analyses fall apart. I've seen teams claim victory over engagement metrics, only to realize later that their "control" group had a completely different user mix by the end of the study.

The actual implementation isn't rocket science. Pick your groups, define your time periods, then run the analysis. Most folks use OLS regression or panel data methods. The tricky part is handling the gotchas - selection bias, measurement errors, all the fun stuff that keeps data scientists up at night.

Card and Krueger's minimum wage study remains the gold standard example. They compared employment in New Jersey (which raised minimum wage) to Pennsylvania (which didn't). Their finding - that employment actually increased - flipped conventional wisdom on its head. That's the power of good causal inference.

Applying difference-in-differences in product experimentation

Here's where it gets practical for product teams. Say you're rolling out a new recommendation algorithm to users in California but not Texas. You can't randomize by state, but you can use DiD to measure the real impact.

The process looks like this:

  1. Track your key metrics for both states before launch

  2. Roll out the feature to California

  3. Keep tracking both states after launch

  4. Calculate the difference-in-differences

But watch out for the common pitfalls. Selection bias is the big one - maybe California users are just different. That's why smart teams use propensity score matching to find comparable users across groups.

Verifying parallel trends isn't just academic hand-waving; it's critical. Plot those pre-treatment trends. If California and Texas users were already diverging before your launch, your whole analysis is suspect. Run placebo tests using fake treatment dates to double-check.

Best practices I've learned the hard way:

  • Define your groups crystal clear upfront (no cherry-picking later)

  • Pick time windows that make sense (not just "whatever gives good results")

  • Use robust standard errors to handle correlation over time

  • Always, always run sensitivity checks

One team I worked with used DiD to evaluate a pricing change rolled out by market. They discovered the initial "20% revenue lift" was actually just 5% after accounting for seasonal trends. Not as exciting, but way more honest.

Advancements and innovations in difference-in-differences analysis

The field isn't standing still. The universal DiD approach is probably the biggest game-changer recently. It basically says "what if parallel trends is too strict?" and loosens things up using fancy math like generalized linear models and propensity scores.

The Reddit data science community has been buzzing about combining DiD with regression discontinuity design. It's like having a backup plan when parallel trends gets shaky. You get the best of both worlds - the time dimension from DiD and the sharp cutoff from RDD.

Machine learning is creeping in too. Teams are using it for better propensity score matching, finding control groups that actually make sense. At Statsig, we're seeing companies get creative with these hybrid approaches, especially when dealing with messy real-world data.

What does this mean for your product team? A few things:

  • You can tackle harder questions (even with imperfect data)

  • Results are more robust to assumption violations

  • You can combine multiple methods for stronger conclusions

The integration possibilities are endless. Imagine using DiD to validate your ML model's predictions, or combining it with bandit algorithms for smarter feature rollouts. The future of product analytics is causal, not just correlational.

Closing thoughts

Difference-in-differences isn't just another statistical technique - it's a different way of thinking about cause and effect in your product. When A/B tests aren't possible or miss important context, DiD helps you understand what really drives user behavior.

Start simple. Pick one feature rollout where you couldn't randomize perfectly. Plot those trends. Calculate that double difference. You might be surprised what you find.

Want to dive deeper? Check out:

  • Matheus Facure's Python Causality Handbook for hands-on examples

  • Card and Krueger's original paper for inspiration

  • The Statsig blog for more practical experimentation tips

Hope you find this useful! Now go forth and establish some causality.



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy