Propensity score matching: Balanced groups

Mon Jun 23 2025

You've probably been in this situation before: you're looking at observational data, trying to figure out if a new feature actually moved the needle, but there's a nagging doubt. What if the users who got the feature were just fundamentally different from those who didn't?

This is where propensity score matching comes in - it's basically a statistical way to create apples-to-apples comparisons when you can't run a proper A/B test. Let's dig into how it works and when you should (and shouldn't) use it.

Introduction to propensity score matching

Propensity score matching (PSM) is one of those techniques that sounds way more complicated than it actually is. At its core, it's about finding similar people in your treatment and control groups so you can make fair comparisons. Think of it like this: if you want to know whether premium users churn less, you can't just compare all premium users to all free users - premium users might be power users who would stick around anyway.

The way PSM works is pretty clever. First, you calculate each person's likelihood of being in the treatment group based on their characteristics - that's the propensity score. As folks on Reddit's epidemiology forum discuss, it's essentially trying to mimic what randomization does in experiments. Then you match people with similar scores across groups.

Here's the key insight: PSM tackles selection bias head-on. When your treatment and control groups look different from the start, any difference in outcomes might just be because of those pre-existing differences, not your treatment. By matching on observed characteristics, you're trying to isolate the actual treatment effect.

The most common way to estimate these scores is through logistic regression - you're basically predicting who would get the treatment based on what you know about them. Once you have scores, you've got options:

  • Nearest neighbor matching: pair each treated person with their closest match

  • Caliper matching: only match if the scores are within a certain distance

  • One-to-many matching: match one treated person to multiple controls (or vice versa)

But here's where it gets tricky - after matching, you need to check if it actually worked. The go-to metric is standardized mean differences (SMD). If your SMD is above 0.1, your groups still aren't balanced enough. Sometimes you'll need to go back to the drawing board with your model, or try double-adjustment to clean up any remaining imbalances.

Methods of estimating and applying propensity scores

Let's get practical about actually doing this. Most people start with logistic regression because it's straightforward - you throw in your covariates and out pops a probability. But here's something interesting: machine learning approaches like random forests or gradient boosting can sometimes do a better job, especially when you have complex interactions between variables.

The matching part is where you have to make some decisions. Nearest neighbor matching is the most intuitive - find the control person with the closest score to each treated person. But sometimes that "closest" match isn't very close at all. That's where caliper matching saves the day by setting a maximum distance. I've seen people get burned by being too strict with their caliper though - you might end up throwing away half your data.

Once you've matched, it's time for the moment of truth: checking your balance. Balance diagnostics tell you whether your matching actually worked. Here's what you're looking for:

  • Standardized mean differences below 0.1 for all covariates

  • Similar variance ratios between groups

  • Overlapping distributions when you plot the propensity scores

If things still look wonky after matching, double-adjustment methods can help. You basically run another regression on your matched sample to mop up any remaining confounding. Just remember - PSM only handles what you can see. Those unobserved confounders? They're still lurking in the shadows.

Challenges and limitations in propensity score matching

Let's be honest about where PSM falls short. The biggest limitation is that it can only balance on variables you actually have. Got some unmeasured confounder that's driving both treatment assignment and outcomes? PSM can't help you there. This is why people sometimes joke that PSM gives you "causal inference with fingers crossed."

You also need decent sample sizes to make this work. I've seen people try PSM with a few hundred observations and wonder why their matches are terrible. The PMC literature suggests you need enough overlap in your propensity scores - if your treatment and control groups are too different, you're out of luck.

Another headache: choosing your matching algorithm matters more than you think. The caliper width, the matching ratio, whether you match with or without replacement - these aren't just technical details. As discussed on Reddit's statistics forum, using 3-to-1 matching might make sense when your groups are unbalanced, but it changes your interpretation.

Sometimes you'll match your heart out and still have imbalanced covariates. This often happens when your groups are fundamentally different - like trying to match startup employees with Fortune 500 workers on salary data. When this happens, you might need double-adjustment or honestly, a different approach entirely.

The bottom line: PSM is a tool, not magic. It works best when you have rich covariate data, reasonable sample sizes, and groups that aren't wildly different to begin with.

Best practices and applications of propensity score matching

Let me walk you through a real example using the Titanic dataset - it's a classic for a reason. Say we want to know if traveling third-class actually hurt your survival chances, controlling for things like age and gender. This is perfect for PSM because we can't exactly randomize people into cabin classes after the fact.

Here's the process:

  1. Calculate propensity scores using logistic regression with age, gender, and other characteristics

  2. Match third-class passengers to similar first/second-class passengers

  3. Check your balance with standardized mean differences

  4. Estimate the treatment effect on your matched sample

The reporting piece is crucial and often bungled. According to a PMC review, tons of published studies skip balance diagnostics entirely. That's like claiming you baked a cake without checking if it's actually cooked. At minimum, you should report SMDs for all covariates, show propensity score distributions, and be transparent about how many observations you dropped.

The Statalist forums make a great point: only include true confounders in your propensity score model. Throwing in every variable you have just adds noise and makes matching harder. Think carefully about what actually affects both treatment assignment and your outcome.

When it comes to matching ratios, flexibility helps. This Reddit discussion highlights that 3-to-1 matching can work well when you have way more controls than treated units. But always check for common support first - as another Reddit thread notes, if your propensity score distributions don't overlap, you're essentially comparing apples to oranges no matter how fancy your matching algorithm is.

At Statsig, we've seen teams successfully use PSM to evaluate feature rollouts when randomization wasn't possible - like analyzing the impact of a premium tier that users self-selected into. The key is being rigorous about your assumptions and honest about the limitations.

Closing thoughts

Propensity score matching is like a good kitchen knife - incredibly useful when used properly, but it won't solve every problem. It shines when you have observational data, good covariates, and need to estimate causal effects. Just remember that it only accounts for what you can measure, requires decent sample sizes, and needs careful implementation.

If you're looking to dive deeper, I'd recommend starting with Stuart's 2010 paper on matching methods, or if you're more hands-on, grab a dataset and try it yourself. The Titanic example I mentioned earlier is freely available and perfect for practice.

And if you're dealing with product experiments where traditional A/B testing hits its limits, tools like Statsig can help you navigate these causal inference challenges with more sophisticated approaches.

Hope you find this useful!



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy