MethodAtlas
Guide

Matching vs. IPW vs. Doubly Robust

A practical comparison of selection-on-observables estimators: matching (exact, propensity score, CEM), inverse probability weighting, and doubly robust/AIPW methods. Covers assumptions, tradeoffs, and when to use each approach.

Three Approaches to the Same Problem

When your identification strategy relies on selection on observables — the assumption that, conditional on measured covariates, treatment assignment is independent of potential outcomes — you have several estimators to choose from. The three major families are:

  1. Matching: find untreated units that look like treated units and compare outcomes directly.
  2. Inverse Probability Weighting (IPW): reweight the sample so that treatment is independent of covariates.
  3. Doubly Robust (DR) / AIPW: combine an outcome model with a propensity score model for double protection against misspecification.

All three families target the same causal parameter (typically the ATT or ATE) under the same core assumption. The difference lies in how each family uses the covariate information and what happens when the models are wrong.

The Common Foundation: Conditional Independence

Every method in this guide requires the conditional independence assumption (CIA), also known as unconfoundedness or ignorability:

{Y(0),Y(1)} ⁣ ⁣ ⁣DX\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X

where Y(0)Y(0) and Y(1)Y(1) are potential outcomes, DD is treatment, and XX is the vector of observed covariates. In words: once you condition on XX, treatment assignment carries no additional information about potential outcomes.

CIA is untestable. You can never verify from data alone that you have observed all relevant confounders. Institutional knowledge, rich covariates, and sensitivity analysis are how you build a case for the assumption's plausibility.

In addition to CIA, all three families require overlap (also called common support or positivity):

0<P(D=1X=x)<1for all x in the support of X0 < P(D = 1 \mid X = x) < 1 \quad \text{for all } x \text{ in the support of } X

Every covariate value observed in the treated group must also appear in the control group, and vice versa. Overlap violations cause all three estimators to break down — but the estimators break down in different ways.

Matching Methods

Matching constructs the counterfactual by directly pairing treated units with similar control units. The treatment effect is the average difference in outcomes between matched pairs.

Exact Matching

The simplest form: match treated and control units that have identical covariate values. Exact matching is unbiased when feasible but suffers from the curse of dimensionality — with more than a few covariates, the probability of finding exact matches drops to zero. Exact matching works best with discrete covariates and small covariate sets.

Propensity Score Matching (PSM)

Rather than matching on all covariates directly, PSM matches on the propensity score e(X)=P(D=1X)e(X) = P(D = 1 \mid X), which collapses the covariate vector into a single scalar. The Rosenbaum-Rubin theorem shows that conditioning on the propensity score is sufficient for removing confounding from observed covariates.

PSM implementation choices:

  • Nearest-neighbor matching (with or without caliper): match each treated unit to the closest control unit(s) on the propensity score.
  • Kernel matching: weight all control units by a kernel function of the propensity score distance.
  • Radius matching: match each treated unit to all control units within a caliper radius.
  • With or without replacement: matching with replacement allows control units to be used multiple times, reducing bias but increasing variance.

Coarsened Exact Matching (CEM)

CEM coarsens continuous covariates into bins and then performs exact matching on the coarsened values. CEM avoids model dependence (no propensity score model needed) and bounds the maximum imbalance by the coarsening level. The tradeoff: coarser bins yield more matches but more within-bin heterogeneity; finer bins yield fewer matches but better balance.

Strengths of Matching

  • Nonparametric: makes no assumptions about the functional form of the outcome model.
  • Transparent: you can inspect the matched sample and verify covariate balance directly.
  • Forces attention to overlap: units without good matches are naturally excluded, preventing extrapolation.

Weaknesses of Matching

  • Discards data: unmatched control units are thrown away, reducing precision.
  • Matching estimator choice sensitivity: results can depend on the distance metric, caliper, and number of matches.
  • PSM model dependence: propensity score matching is only as good as the propensity score model. A misspecified propensity score model produces bad matches.
  • No double protection: if the propensity score model is wrong, the estimate is biased — full stop.
Concept Check

You estimate propensity scores and find that 15% of treated units have propensity scores above 0.95 (very high). What does this pattern indicate and what should you do?

Inverse Probability Weighting (IPW)

IPW takes a different approach: rather than selecting a matched subsample, IPW reweights all observations so that the distribution of covariates is balanced between treated and control groups. The weight for each unit is the inverse of the probability of receiving the treatment the unit actually received.

For the ATE, the Horvitz-Thompson weights are:

wi=Die(Xi)+1Di1e(Xi)w_i = \frac{D_i}{e(X_i)} + \frac{1 - D_i}{1 - e(X_i)}

For the ATT, the weights for control units are e(Xi)/(1e(Xi))e(X_i) / (1 - e(X_i)) while treated units receive weight 1.

Strengths of IPW

  • Uses all data: no observations are discarded (unlike matching), which can improve precision.
  • Simple to implement: estimate a propensity score model, compute weights, run a weighted regression or weighted mean comparison.
  • Semiparametric efficiency: the Hajek (normalized) IPW estimator can achieve the semiparametric efficiency bound under correct specification.

Weaknesses of IPW

  • Extreme weight sensitivity: when propensity scores are near 0 or 1, the inverse weights become very large, causing high variance and instability. A single unit with a propensity score of 0.01 receives a weight of 100, dominating the estimate.
  • Model dependence: like PSM, IPW relies entirely on a correctly specified propensity score model. If the model is wrong, the reweighting does not balance covariates, and the estimate is biased.
  • No outcome modeling: IPW does not model the relationship between covariates and outcomes. If the propensity score model is slightly wrong, there is no backup.

Doubly Robust Estimation (AIPW)

Doubly robust estimation, also called Augmented IPW (AIPW), combines an outcome model (predicting YY from XX) with a propensity score model (predicting DD from XX). The estimator augments the IPW estimator with a bias correction term from the outcome model.

The AIPW estimator for the ATE is:

τ^AIPW=1ni=1n[Di(Yiμ^1(Xi))e^(Xi)+μ^1(Xi)]1ni=1n[(1Di)(Yiμ^0(Xi))1e^(Xi)+μ^0(Xi)]\hat{\tau}_{AIPW} = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{D_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} + \hat{\mu}_1(X_i) \right] - \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1-\hat{e}(X_i)} + \hat{\mu}_0(X_i) \right]

where μ^d(X)\hat{\mu}_d(X) is the estimated conditional mean outcome under treatment dd and e^(X)\hat{e}(X) is the estimated propensity score.

The Double Robustness Property

The estimator is consistent if either model is correctly specified — you do not need both to be right. If the propensity score model is correct but the outcome model is wrong, the IPW component produces unbiased estimates. If the outcome model is correct but the propensity score model is wrong, the regression adjustment component produces unbiased estimates. Only when both models are wrong does the estimator break down.

Strengths of DR/AIPW

  • Double protection: consistent under strictly weaker conditions than either matching/IPW or regression adjustment alone.
  • Semiparametric efficiency: AIPW achieves the semiparametric efficiency bound when both models are correctly specified, meaning no regular estimator can do better asymptotically.
  • Natural pairing with ML: the doubly robust structure is the foundation of Double/Debiased Machine Learning (DML), which uses flexible ML models for the nuisance parameters while maintaining valid inference.
  • Bias reduction: even when neither model is exactly right, AIPW typically has smaller bias than IPW or regression alone because the errors in the two models partially offset each other.

Weaknesses of DR/AIPW

  • Two models to specify: you must specify and estimate both an outcome model and a propensity score model, increasing the analyst's burden.
  • Extreme weights still matter: when propensity scores are extreme, the augmentation helps but does not fully solve the instability.
  • Not magic: "doubly robust" does not mean "always right." If both models are badly misspecified, AIPW can be severely biased.
Concept Check

Your propensity score model is a simple logit with linear terms, and the true treatment assignment mechanism involves complex interactions. Your outcome model is a flexible random forest. Under doubly robust estimation, is the treatment effect estimate consistent?

Head-to-Head Comparison

FeatureMatchingIPWDoubly Robust (AIPW)
Models neededPropensity score (or distance metric)Propensity scorePropensity score + outcome
Uses all data?No (unmatched units dropped)YesYes
Robust to PS misspecification?NoNoYes (if outcome model correct)
Robust to outcome misspecification?Yes (nonparametric)N/A (no outcome model)Yes (if PS model correct)
EfficiencyLower (data discarded)Moderate (can be high-variance)Highest (semiparametric bound)
Extreme weight sensitivityLow (trimmed by matching)HighModerate (augmentation helps)
TransparencyHigh (inspect matched sample)Moderate (inspect weights)Lower (two models interact)

When to Use Each Method

Use Matching When:

  • You want maximal transparency and simplicity. Matching produces a subsample where treated and control units are comparable, and you can verify balance directly.
  • The research audience values nonparametric credibility. In fields where regression is viewed skeptically, matched comparisons carry persuasive weight.
  • Overlap is limited and you want the estimator to naturally restrict to the common support region.
  • You have a small number of discrete covariates where exact or CEM matching is feasible.

Use IPW When:

  • You want to use all data without discarding observations.
  • The propensity score model is well-specified and overlap is good (no extreme weights).
  • You are estimating the ATE (population-level effect) rather than the ATT. IPW naturally targets the ATE with Horvitz-Thompson weights.
  • The setting calls for survey-style reweighting, which IPW generalizes.

Use Doubly Robust (AIPW) When:

  • You want protection against single-model misspecification and can specify both an outcome model and a propensity score model. In settings with adequate overlap and reasonable nuisance estimation quality, AIPW's double robustness provides a meaningful safeguard.
  • You are working with high-dimensional covariates and plan to use flexible models (random forests, LASSO, neural networks) for the nuisance parameters. AIPW paired with cross-fitting is the foundation of DML.
  • You want to achieve the semiparametric efficiency bound, extracting maximal precision from your data when both models are well-specified.
  • You want a method that tends to be more resilient to mild misspecification than pure matching or pure IPW, though AIPW is not immune to finite-sample instability when propensity scores are extreme or nuisance models are poorly estimated.

Practical Workflow

Regardless of which estimator you choose, follow the same workflow:

  1. Specify and estimate the propensity score. Check balance. Examine the overlap region. Diagnose extreme scores.
  2. Estimate the treatment effect using your primary estimator (matching, IPW, or AIPW).
  3. Check covariate balance in the matched/reweighted sample. Standardized mean differences should be small (< 0.1 is a common threshold).
  4. Run robustness checks. Use all three estimators if feasible. Agreement across methods is reassuring.
  5. Conduct sensitivity analysis. Report Oster bounds, Cinelli-Hazlett robustness values, or E-values to assess how much unobserved confounding would be needed to explain your result.
  6. Report the overlap region. Show the propensity score distribution by treatment group. Report how many units fall outside common support and how you handled those units.
Concept Check

You estimate the ATT using propensity score matching and get a coefficient of 0.15. You then estimate the ATT using AIPW and get 0.08. What is the most productive next step?