Guide·10 min read

Guide

Matching vs. IPW vs. Doubly Robust

A practical comparison of selection-on-observables estimators: matching (exact, PS, CEM), inverse probability weighting, and doubly robust AIPW.

Reading Time: ~10 min read · 8 sections · 3 interactive exercises

Three Approaches to the Same Problem

When your identification strategy relies on selection on observables — the assumption that, conditional on measured covariates, treatment assignment is independent of potential outcomes — you have several estimators to choose from. The three major families are:

Matching: find untreated units that look like treated units and compare outcomes directly.
Inverse Probability Weighting (IPW): reweight the sample so that treatment is independent of covariates.
Doubly Robust (DR) / AIPW: combine an outcome model with a model for double protection against misspecification.

The propensity score itself was introduced by Rosenbaum and Rubin (1983) as a one-dimensional summary of all observed confounders: conditioning on the (correctly specified) propensity score blocks the same backdoor paths that conditioning on the full covariate vector $X$ would block.

All three families target the same causal parameter (typically the ATT or ATE) under the same core assumption. The difference lies in how each family uses the covariate information and what happens when the models are wrong.

The Common Foundation: Conditional Independence

Every method in this guide requires the conditional independence assumption (CIA), also known as or ignorability:

\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X

where $Y(0)$ and $Y(1)$ are potential outcomes, $D$ is treatment, and $X$ is the vector of observed covariates. In words: once you condition on $X$ , treatment assignment carries no additional information about potential outcomes.

CIA is untestable. You can never verify from data alone that you have observed all relevant confounders. Institutional knowledge, rich covariates, and sensitivity analysis are how you build a case for the assumption's plausibility.

In addition to CIA, all three families require overlap (also called or positivity):

0 < P(D = 1 \mid X = x) < 1 \quad \text{for all } x \text{ in the support of } X

Every covariate value observed in the treated group must also appear in the control group, and vice versa. Overlap violations cause all three estimators to break down — but the estimators break down in different ways.

Matching Methods

Matching constructs the counterfactual by directly pairing treated units with similar control units. The treatment effect is the average difference in outcomes between matched pairs.

Exact Matching

The simplest form: match treated and control units that have identical covariate values. Exact matching is unbiased when feasible and when the CIA holds for the matched covariates but suffers from the curse of dimensionality — with more than a few covariates, the probability of finding exact matches drops to zero. Exact matching works best with discrete covariates and small covariate sets.

Propensity Score Matching (PSM)

Rather than matching on all covariates directly, PSM matches on the propensity score $e(X) = P(D = 1 \mid X)$ , which collapses the covariate vector into a single scalar. The Rosenbaum-Rubin theorem shows that conditioning on the propensity score is sufficient for removing confounding from observed covariates.

PSM implementation choices:

Nearest-neighbor matching (with or without caliper): match each treated unit to the closest control unit(s) on the propensity score.
Kernel matching: weight all control units by a kernel function of the propensity score distance.
Radius matching: match each treated unit to all control units within a caliper radius.
With or without replacement: matching with replacement allows control units to be used multiple times, reducing bias but increasing variance.

CEM coarsens continuous covariates into bins and then performs exact matching on the coarsened values. CEM avoids model dependence (no propensity score model needed) and bounds the maximum imbalance by the coarsening level. The tradeoff: coarser bins yield more matches but more within-bin heterogeneity; finer bins yield fewer matches but better balance.

Strengths of Matching

No outcome model required: does not specify a functional form for the outcome equation (though propensity score matching requires a parametric model for the propensity score itself).
Transparent: you can inspect the matched sample and verify covariate balance directly.
Forces attention to overlap: units without good matches are naturally excluded, preventing extrapolation.

Weaknesses of Matching

Discards data: unmatched control units are thrown away, reducing precision.
Matching estimator choice sensitivity: results can depend on the distance metric, caliper, and number of matches.
PSM model dependence: propensity score matching is only as good as the propensity score model. A misspecified propensity score model produces bad matches.
No double protection: if the propensity score model is wrong, the estimate is biased — full stop.

Concept Check

You estimate propensity scores and find that 15% of treated units have propensity scores above 0.95 (very high). What does this pattern indicate and what should you do?

The propensity score model fits well — high scores for treated units are expected.There is an overlap violation. You generally want to trim or redefine the sample to the region of common support and report the restricted estimand.Include the extreme units anyway — excluding them introduces bias.Re-estimate the propensity score with fewer covariates until the problem goes away.

Inverse Probability Weighting (IPW)

IPW takes a different approach: rather than selecting a matched subsample, IPW reweights all observations so that the distribution of covariates is balanced between treated and control groups. The weight for each unit is the inverse of the probability of receiving the treatment the unit actually received.

For the ATE, the Horvitz-Thompson weights are:

w_i = \frac{D_i}{e(X_i)} + \frac{1 - D_i}{1 - e(X_i)}

For the ATT, the weights for control units are $e(X_i) / (1 - e(X_i))$ while treated units receive weight 1.

Strengths of IPW

Uses all data: no observations are discarded (unlike matching), which can improve precision.
Simple to implement: estimate a propensity score model, compute weights, run a weighted regression or weighted mean comparison.
Consistency without outcome modeling: IPW provides consistent estimates without specifying an outcome model — only the propensity score model needs to be correct.

Weaknesses of IPW

Extreme weight sensitivity: when propensity scores are near 0 or 1, the inverse weights become very large, causing high variance and instability. A single unit with a propensity score of 0.01 receives a weight of 100, dominating the estimate.
Model dependence: like PSM, IPW relies entirely on a correctly specified propensity score model. If the model is wrong, the reweighting does not balance covariates, and the estimate is biased.
No outcome modeling: IPW does not model the relationship between covariates and outcomes. If the propensity score model is slightly wrong, there is no backup.

Doubly Robust Estimation (AIPW)

Doubly robust estimation, also called Augmented IPW (AIPW), combines an outcome model (predicting $Y$ from $X$ ) with a propensity score model (predicting $D$ from $X$ ). The estimator augments the IPW estimator with a bias correction term from the outcome model.

The AIPW estimator for the ATE is:

\hat{\tau}_{AIPW} = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{D_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} + \hat{\mu}_1(X_i) \right] - \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1-\hat{e}(X_i)} + \hat{\mu}_0(X_i) \right]

where $\hat{\mu}_d(X)$ is the estimated conditional mean outcome under treatment $d$ and $\hat{e}(X)$ is the estimated propensity score.

The Double Robustness Property

The estimator is consistent if either model is correctly specified — you typically do not need both to be right. If the propensity score model is correct but the outcome model is wrong, the IPW component produces unbiased estimates. If the outcome model is correct but the propensity score model is wrong, the regression adjustment component produces unbiased estimates. Only when both models are wrong does the estimator break down.

Strengths of DR/AIPW

Double protection: consistent under strictly weaker conditions than either matching/IPW or regression adjustment alone.
Semiparametric efficiency: AIPW achieves the semiparametric efficiency bound when both models are correctly specified, meaning no regular estimator can do better asymptotically.
Natural pairing with ML: the doubly robust structure is the foundation of Double/Debiased Machine Learning (DML), which uses flexible ML models for the nuisance parameters while maintaining valid inference.
Bias reduction: even when neither model is exactly right, AIPW typically has smaller bias than IPW or regression alone because the errors in the two models partially offset each other.

Weaknesses of DR/AIPW

Two models to specify: you typically need to specify and estimate both an outcome model and a propensity score model, increasing the analyst's burden.
Extreme weights still matter: when propensity scores are extreme, the augmentation helps but does not fully solve the instability.
Not magic: "doubly robust" does not mean "always right." If both models are badly misspecified, AIPW can be severely biased.

Concept Check

Your propensity score model is a simple logit with linear terms, and the true treatment assignment mechanism involves complex interactions. Your outcome model is a flexible random forest. Under doubly robust estimation, is the treatment effect estimate consistent?

No, because the propensity score model is misspecified.Yes, because the flexible outcome model can approximate the true conditional expectation function, and double robustness only requires one model to be correct.Yes, but only if the propensity score model passes a balance test.It depends on the sample size.

Head-to-Head Comparison

Feature	Matching	IPW	Doubly Robust (AIPW)
Models needed	Propensity score (or distance metric)	Propensity score	Propensity score + outcome
Uses all data?	No (unmatched units dropped)	Yes	Yes
Robust to PS misspecification?	No	No	Yes (if outcome model correct)
Robust to outcome misspecification?	Yes (nonparametric)	N/A (no outcome model)	Yes (if PS model correct)
Efficiency	Lower (data discarded)	Moderate (can be high-variance)	Highest (semiparametric bound)
Extreme weight sensitivity	Low (trimmed by matching)	High	Moderate (augmentation helps)
Transparency	High (inspect matched sample)	Moderate (inspect weights)	Lower (two models interact)

When to Use Each Method

Use Matching When:

You want maximal transparency and simplicity. Matching produces a subsample where treated and control units are comparable, and you can verify balance directly.
The research audience values nonparametric credibility. In fields where regression is viewed skeptically, matched comparisons carry persuasive weight.
Overlap is limited and you want the estimator to naturally restrict to the common support region.
You have a small number of discrete covariates where exact or CEM matching is feasible.

Use IPW When:

You want to use all data without discarding observations.
The propensity score model is well-specified and overlap is good (no extreme weights).
You are estimating the ATE (population-level effect) rather than the ATT. IPW naturally targets the ATE with Horvitz-Thompson weights.
The setting calls for survey-style reweighting, which IPW generalizes.

Use Doubly Robust (AIPW) When:

You want protection against single-model misspecification and can specify both an outcome model and a propensity score model. In settings with adequate overlap and reasonable nuisance estimation quality, AIPW's double robustness provides a meaningful safeguard.
You are working with high-dimensional covariates and plan to use flexible models (random forests, LASSO, neural networks) for the nuisance parameters. AIPW paired with cross-fitting is the foundation of DML.
You want to achieve the semiparametric efficiency bound, extracting maximal precision from your data when both models are well-specified.
You want a method that tends to be more resilient to mild misspecification than pure matching or pure IPW, though AIPW is not immune to finite-sample instability when propensity scores are extreme or nuisance models are poorly estimated.

Practical Workflow

Regardless of which estimator you choose, follow the same workflow:

Specify and estimate the propensity score. Check balance. Examine the overlap region. Diagnose extreme scores.
Estimate the treatment effect using your primary estimator (matching, IPW, or AIPW).
Check covariate balance in the matched/reweighted sample. Standardized mean differences should be small (< 0.1 is a common threshold).
Run robustness checks. Use all three estimators if feasible. Agreement across methods is reassuring.
Conduct sensitivity analysis. Report Oster bounds, Cinelli-Hazlett robustness values, or E-values to assess how much unobserved confounding would be needed to explain your result.
Report the overlap region. Show the propensity score distribution by treatment group. Report how many units fall outside common support and how you handled those units.

Concept Check

You estimate the ATT using propensity score matching and get a coefficient of 0.15. You then estimate the ATT using AIPW and get 0.08. What is the most productive next step?

Report the matching estimate because matching is more transparent.Report the AIPW estimate because AIPW is more efficient.Check whether the propensity score model is adequate (balance diagnostics), examine the outcome model fit, and investigate whether the discrepancy is driven by specific subgroups or extrapolation beyond common support.Average the two estimates to get a consensus estimate.

Three Approaches to the Same Problem#

The Common Foundation: Conditional Independence#

Matching Methods#

Exact Matching#

Propensity Score Matching (PSM)#

Coarsened Exact Matching (CEM)#

Strengths of Matching#

Weaknesses of Matching#

Inverse Probability Weighting (IPW)#

Strengths of IPW#

Weaknesses of IPW#

Doubly Robust Estimation (AIPW)#

The Double Robustness Property#

Strengths of DR/AIPW#

Weaknesses of DR/AIPW#

Head-to-Head Comparison#

When to Use Each Method#

Use Matching When:#

Use IPW When:#

Use Doubly Robust (AIPW) When:#

Practical Workflow#

Three Approaches to the Same Problem

The Common Foundation: Conditional Independence

Matching Methods

Exact Matching

Propensity Score Matching (PSM)

Strengths of Matching

Weaknesses of Matching

Inverse Probability Weighting (IPW)

Strengths of IPW

Weaknesses of IPW

Doubly Robust Estimation (AIPW)

The Double Robustness Property

Strengths of DR/AIPW

Weaknesses of DR/AIPW

Head-to-Head Comparison

When to Use Each Method

Use Matching When:

Use IPW When:

Use Doubly Robust (AIPW) When:

Practical Workflow