When should I use Matching (PSM, CEM, NN, Weighting)?

When selection into treatment depends only on observed covariates (selection on observables / conditional independence), and you want a transparent, nonparametric comparison of treated and control units.

What is the key assumption of Matching (PSM, CEM, NN, Weighting)?

Conditional independence (unconfoundedness): conditional on observed pre-treatment covariates, treatment assignment is independent of potential outcomes. Also requires overlap (common support) — for every covariate profile, both treated and control units must exist.

What is the most common mistake with Matching (PSM, CEM, NN, Weighting)?

Matching on post-treatment variables (which introduces collider bias), or failing to assess and report balance after matching. Checking standardized mean differences is a standard diagnostic step.

Method·intermediate·10 min read

Model-BasedEstablished

Matching (PSM, CEM, NN, Weighting)

Reduces selection bias by comparing treated units to similar control units based on observed characteristics.

When to Use: When selection into treatment depends only on observed covariates (selection on observables / conditional independence), and you want a transparent, nonparametric comparison of treated and control units.
Assumption: Conditional independence (unconfoundedness): conditional on observed pre-treatment covariates, treatment assignment is independent of potential outcomes. Also requires overlap (common support) — for every covariate profile, both treated and control units must exist.
Mistake: Matching on post-treatment variables (which introduces collider bias), or failing to assess and report balance after matching. Checking standardized mean differences is a standard diagnostic step.
Reading Time: ~10 min read · 11 sections · 9 interactive exercises

One-Line Implementation

Rmatchit(treatment ~ x1 + x2, data = df, method = 'nearest', distance = 'glm')

Statateffects psmatch (outcome) (treatment x1 x2), atet

PythonCausalModel(Y, D, X).est_via_matching() # causalinference

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Evaluating a Job Training Program

In the 1970s, the National Supported Work (NSW) Demonstration randomly assigned disadvantaged workers to a job training program. The experimental data (Dehejia & Wahba 1999 sample) showed the program raised earnings by about $1,794 per year.

But what if the experiment had never been run? Could you recover the treatment effect using observational data alone? Dehejia and Wahba (1999) took the treated group from the NSW experiment and matched them to comparison groups drawn from large survey datasets (the CPS and PSID). Using matching on pre-treatment covariates (age, education, earnings history, race, marital status), they recovered estimates remarkably close to the experimental benchmark.

This paper became both a prominent advertisement and a cautionary tale for matching. It showed matching can work — but only when you have the right covariates and the comparison group overlaps well with the treated group. Matching has since become a standard tool across disciplines — for example, Azoulay et al. (2014) employed matching to test whether the Matthew effect in science reflects genuine cumulative advantage or selection.

had previously shown that naive observational methods (including OLS) failed badly at recovering the experimental benchmark. Matching was an improvement — but not a magic bullet.

AOverview

The Core Problem

In observational studies, treated and control units differ systematically. Comparing their average outcomes confounds the treatment effect with selection bias. Matching addresses this by finding, for each treated unit, one or more control units that are "similar" on observed characteristics.

The Selection-on-Observables Assumption

All matching methods rest on:

\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X

This condition says: conditional on observed covariates $X$ , treatment assignment is independent of . In words, once you condition on the right set of observables, treated and control units are comparable — as if treatment were randomly assigned within strata of $X$ . is the same identifying assumption needed for any selection-on-observables approach, including OLS when used for causal inference. The advantage of matching is that it achieves nonparametric identification of the treatment effect without relying on a specific functional form for the outcome equation.

This assumption is also called , unconfoundedness, or ignorability.

The Four Main Matching Approaches

Method	How It Matches	Key Feature
Propensity Score Matching (PSM)	Match on estimated probability of treatment: $e(X_i) = P(D_i = 1	X_i)$
Coarsened Exact Matching (CEM)	Coarsen covariates into bins, then exact-match within bins	Avoids propensity score estimation; transparent
Nearest-Neighbor (NN)	Match each treated unit to the closest control unit(s) in covariate space	Simple and intuitive; can match on Mahalanobis distance
Inverse Probability Weighting (IPW)	Weight observations by inverse of propensity score	Uses all observations (no dropping); semi-parametric

Common Confusions

BIdentification

Propensity Score Theorem

Rosenbaum and Rubin (1983) proved a remarkable result: if conditional independence holds given $X$ , then it also holds given only the $e(X) = P(D = 1 \mid X)$ :

\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid e(X)

This theorem reduces the matching problem from high-dimensional covariate space to a single dimension. Instead of finding units identical on age, education, income, race, etc., you match on a single number — the estimated probability of being treated.

Common Support

In addition to conditional independence, you need (or overlap):

0 < P(D = 1 \mid X = x) < 1 \quad \text{for all } x \text{ in the support of } X

For every value of the covariates, there must be both treated and control units. If treated units have no comparable controls (e.g., all high-income individuals are treated), the treatment effect is not identified in that region.

SUTVA (Stable Unit Treatment Value Assumption)

No interference between units: one unit's treatment assignment does not affect another unit's outcomes. In matching, this means that being matched (or unmatched) does not change anyone's potential outcomes, and there are no hidden variations of treatment — each treated unit receives the same version of the intervention. If the job training program displaced untrained workers from jobs, the control group's outcomes would be contaminated by .

The Estimand: ATT vs. ATE

ATT (): For each treated unit, find a matched control and compute the difference. Most common with matching.
ATE (): Match in both directions — find controls for treated units AND find treated units for controls. Requires stronger overlap.

CVisual Intuition

Imagine a two-dimensional scatterplot. The horizontal axis is age, the vertical axis is income. Red dots are treated individuals; blue dots are controls. Without matching, the red and blue dots occupy different regions — treated units are younger and lower-income.

Matching finds, for each red dot, the nearest blue dot (or a set of nearby blue dots). After matching, the remaining comparison group (the matched blues) has the same age/income distribution as the treated group. The distribution of covariates is balanced.

With propensity score matching, you collapse both dimensions into one score, align treated and control units on that score, and compare. The balance is not on any single covariate but on the overall probability of treatment.

DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: For each treated unit, find one or more controls with similar propensity scores and take the average difference. This procedure removes bias from observed confounders.

Propensity score estimation (first step):

Estimate $\hat{e}(X_i) = P(D_i = 1 | X_i)$ using logit:

\hat{e}(X_i) = \Lambda(X_i'\hat{\gamma})

Matching (second step):

For each treated unit $i$ with $D_i = 1$ , define the matched set $\mathcal{M}(i)$ as the control unit(s) closest in propensity score:

\mathcal{M}(i) = \arg\min_{j: D_j = 0} |\hat{e}(X_i) - \hat{e}(X_j)|

ATT estimation:

\hat{\tau}_{ATT} = \frac{1}{N_1} \sum_{i: D_i = 1} \left[ Y_i - \frac{1}{|\mathcal{M}(i)|} \sum_{j \in \mathcal{M}(i)} Y_j \right]

IPW alternative (for ATT):

\hat{\tau}_{ATT,IPW} = \frac{1}{N_1} \sum_{i: D_i=1} Y_i \;-\; \frac{\displaystyle\sum_{i: D_i=0} \frac{\hat{e}(X_i)}{1-\hat{e}(X_i)} Y_i}{\displaystyle\sum_{i: D_i=0} \frac{\hat{e}(X_i)}{1-\hat{e}(X_i)}}

The IPW estimator reweights the control group to look like the treated group, using the propensity score. The weights $\hat{e}(X)/(1-\hat{e}(X))$ tilt the control distribution toward the treated covariate profile.

Standard errors: Because the propensity score is estimated, standard errors must account for this estimation uncertainty. Abadie and Imbens (2008) showed that the standard nonparametric bootstrap is inconsistent for nearest-neighbor matching with replacement; use the analytical formula from Abadie and Imbens (2006) instead. For IPW, bootstrap or robust SEs work.

EImplementation

1# Requires: MatchIt, cobalt, lmtest, sandwich, WeightIt
2library(MatchIt)
3library(cobalt)
4
5# --- Step 1: Propensity Score Matching ---
6# matchit() estimates a logit model for P(treatment|X), then matches
7# treated units to their nearest control by propensity score distance.
8# method="nearest" = greedy nearest-neighbor; replace=FALSE = without replacement
9# ratio=1 = one control per treated unit (1:1 matching)
10m_out <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
11               data = df, method = "nearest", distance = "logit",
12               replace = FALSE, ratio = 1)
13
14# --- Step 2: Check Covariate Balance ---
15# Balance diagnostics are essential: matching only works if it creates
16# comparable groups. summary() shows standardized mean differences (SMDs).
17summary(m_out)
18# love.plot() visualizes balance; threshold of 0.1 SMD is a common standard.
19# All covariates should fall below the threshold after matching.
20love.plot(m_out, thresholds = c(m = .1))
21
22# --- Step 3: Estimate Treatment Effect on Matched Data ---
23# match.data() extracts the matched sample with matching weights.
24# Regress the outcome on treatment using matched data with weights.
25m_data <- match.data(m_out)
26library(lmtest)
27library(sandwich)
28fit <- lm(earnings ~ training, data = m_data, weights = weights)
29# HC1 robust SEs account for heteroskedasticity in the matched sample.
30# The coefficient on "training" estimates the ATT.
31coeftest(fit, vcov. = vcovHC(fit, type = "HC1"))
32
33# --- Step 4: Coarsened Exact Matching (CEM) ---
34# CEM coarsens continuous covariates into bins, then matches exactly on bins.
35# Guarantees balance within strata; unmatched units are discarded.
36m_cem <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
37               data = df, method = "cem")
38summary(m_cem)
39
40# --- Step 5: Inverse Probability Weighting (IPW) ---
41# IPW reweights observations by 1/P(treatment|X) to create a pseudo-population
42# where treatment is independent of covariates. estimand="ATT" targets
43# the average treatment effect on the treated.
44library(WeightIt)
45w_out <- weightit(training ~ age + education + re74 + re75 + black + hispanic,
46                data = df, method = "ps", estimand = "ATT")
47# Check balance after weighting; threshold of 0.1 SMD
48bal.tab(w_out, thresholds = 0.1)

RequiresMatchIt cobalt lmtest sandwich WeightIt

FDiagnostics

Standardized Differences

A widely recommended metric for assessing balance. For each covariate:

\text{SD}_j = \frac{\bar{X}_{j,\text{treated}} - \bar{X}_{j,\text{control}}}{\sqrt{(s^2_{j,\text{treated}} + s^2_{j,\text{control}})/2}}

Common Support Diagnostics

Plot the distribution of propensity scores for treated and control groups. They should overlap substantially.
Trim observations outside the region of common support (propensity scores below the minimum or above the maximum of the other group).
If trimming drops many observations, your comparison is fragile and results may not generalize.

Variance Ratios

Beyond means, check that the variance of each covariate is similar across groups after matching. A variance ratio (treated/control) far from 1.0 indicates remaining imbalance in the tails of the distribution.

Distribution Checks

For critical covariates, compare the full distributions (not just means) using QQ plots or Kolmogorov-Smirnov tests.

Interpreting Your Results

It is important to present the ATT (effect on the treated) or ATE, and be explicit about which you estimate.
Matching estimates are only as credible as the conditional independence assumption. If you typically cannot convincingly argue that you have captured all confounders, say so.
Report both the matched and unmatched estimates. The difference tells you how much bias matching removes (at least for observed confounders).
Consider sensitivity analysis (Rosenbaum bounds) to assess how much unobserved confounding would be needed to overturn your results (Rosenbaum, 2002).

GWhat Can Go Wrong

Problem	What It Does	How to Fix It
Unobserved confounders	Matching cannot fix what it cannot see — estimates are biased	Sensitivity analysis (Rosenbaum bounds); combine with IV or DiD
Matching on post-treatment variables	Introduces collider bias	Match only on pre-treatment covariates
Poor common support	Comparisons are based on extrapolation	Trim, restrict sample, report trimmed and full results
Over-reliance on PSM	PSM can worsen balance (King & Nielsen, 2019)	Consider CEM, entropy balancing, or doubly robust methods
Wrong standard errors	Standard nonparametric bootstrap is invalid for NN matching	Use Abadie-Imbens analytical SEs; or use IPW with robust SEs
Too many matching variables	Curse of dimensionality; poor matches	Use propensity score or dimension-reducing methods
Caliper too wide	Matches are poor quality	Tighten the caliper; report balance

What Can Go Wrong

Matching on a Post-Treatment Variable (Collider Bias)

Match job training participants to non-participants using only pre-treatment covariates: age, education, prior earnings (re74, re75), race, marital status

ATT estimate = $1,794 (close to the NSW experimental benchmark of $1,800). Balance is good on all pre-treatment covariates.

What Can Go Wrong

Poor Common Support (Extrapolation)

Treated and control groups have substantial overlap in propensity score distributions

After trimming the 5% of treated units outside common support, ATT = $1,650 (SE = $520). 95% of treated units have well-matched controls. Results are stable to trimming thresholds.

What Can Go Wrong

Using Bootstrap Standard Errors with Nearest-Neighbor Matching

Use Abadie-Imbens analytical standard errors for nearest-neighbor matching

ATT = $1,794, Abadie-Imbens SE = $680, 95% CI: [$461, $3,127]. Coverage is correct.

Concept Check

After propensity score matching, you find that the standardized difference for pre-treatment earnings (re75) is 0.03, but the standardized difference for age is 0.22. What should you do?

Proceed — the average standardized difference is below 0.10.Re-specify the propensity score model (e.g., add age-squared or interactions) and re-match.Control for age in the outcome regression and call it a day.Report the result but note the limitation.

HPractice

Concept Check

A researcher matches treated and control units on propensity scores and finds excellent balance: all standardized differences are below 0.05. She concludes that the matching estimate is now as credible as an experimental estimate. What is wrong with this claim?

Nothing — good balance means the groups are comparable.Matching balances observed covariates, but unobserved confounders may still bias the estimate. An experiment balances both.The problem is that she used propensity scores, which can increase imbalance.She needs to use a larger caliper to include more observations.

Concept Check

You estimate the ATT of a job training program using propensity score matching. After matching, you find that 40% of treated units are dropped because they fall outside the region of common support. What should you be most concerned about?

The sample size is now too small for statistical significance.The ATT estimate now applies only to the 60% of treated units within common support, and these remaining units may not be representative of all treated units.Switch to ATE estimation to use all observations.The propensity score model is misspecified.

Concept Check

A researcher matches charter school students to public school students using pre-treatment test scores, parental income, and race. She also includes 'current school satisfaction' as a matching variable. What is the problem?

School satisfaction is too subjective to use in matching.Current school satisfaction is a post-treatment variable — it may be affected by the charter school enrollment itself, introducing collider bias.Three matching variables are not enough — she needs at least ten.She should use school satisfaction as an outcome, not a matching variable.

Concept Check

After nearest-neighbor propensity score matching (1:1 without replacement), a researcher uses bootstrap standard errors with 500 replications. A reviewer objects. Why?

500 replications is not enough — she should use at least 1,000.Abadie and Imbens (2008) showed that the bootstrap is inconsistent for nearest-neighbor matching — it understates the true standard error.Bootstrap is only valid for parametric models, not matching.She should cluster the bootstrap by propensity score decile.

Guided Exercise

Propensity Score Matching: Job Training and Earnings

A policy analyst wants to estimate the effect of a voluntary job training program on participants' earnings two years later. Participants self-selected into training, so treated and control workers may differ in age, education, and prior work history. She plans to use propensity score matching to create a comparable control group.

Error Detective

Read the analysis below carefully and identify the errors.

A health researcher studies whether a new surgical technique reduces recovery time. They use propensity score matching on patient age, BMI, and insurance type to compare 200 patients who received the new technique to 200 matched controls who received the standard procedure. They report:

"After matching, standardized differences are: age (0.04), BMI (0.06), insurance type (0.02). Balance is excellent. The matched estimate shows a 3.2-day reduction in recovery time (p = 0.001). We use bootstrap standard errors (500 replications). Since treated and control patients are balanced on all observables, the estimate is causal."

They do not report any sensitivity analysis.

Select all errors you can find:

Missing critical confounders from the matching set(Matching variable selection and causal claim)

Bootstrap standard errors used with nearest-neighbor matching(Standard error computation)

Error Detective

Read the analysis below carefully and identify the errors.

An education researcher matches charter school students to traditional public school students using propensity scores estimated from a logistic regression. The propensity score model includes: parental income, parental education, student's 4th-grade test scores, and student race. After 1:1 nearest-neighbor matching:

"Standardized differences are below 0.10 for all variables. The matched charter school students score 0.25 standard deviations higher on 8th-grade math tests (p < 0.05)."

The researcher also reports that 150 of the 500 charter school students were dropped because they fell outside common support (propensity scores above 0.95).

Select all errors you can find:

Dropping 30% of treated units for common support without discussing implications(Common support trimming)

Failing to account for unobserved parental motivation and selection(Identification assumption)

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

A management study examines whether firms that adopt corporate social responsibility (CSR) practices have better financial performance. The authors use propensity score matching on firm size, industry, profitability, and leverage. They match CSR-adopting firms to non-adopting firms and find a positive effect of CSR on Tobin's Q. They conclude that CSR causes better firm performance.

Key Table

Variable	Pre-match SD	Post-match SD
Firm size (log)	0.45	0.04
Industry	0.12	0.02
Profitability	0.38	0.07
Leverage	0.22	0.09
N (treated)	500	480
N (control)	3,200	480

Authors' Identification Claim

After propensity score matching, treated and control firms are balanced on all observed characteristics. Therefore, the difference in Tobin's Q reflects the causal effect of CSR adoption.

ISwap-In: When to Use Something Else

Difference-in-differences: When a policy change creates a natural experiment with temporal variation — DiD does not require selection on observables alone.
IV / 2SLS: When an instrument is available to address selection on unobservables that matching cannot handle.
Doubly robust estimation: When you want robustness to misspecification of either the outcome model or the propensity score model — combines matching/weighting logic with regression adjustment.
DML: When the covariate space is high-dimensional and linear specifications may miss important nonlinearities — DML uses machine learning for nuisance estimation with valid post-selection inference. Recent work by Rathje et al. (2024) demonstrates how machine learning methods can improve matching quality in management research.
Entropy balancing: When exact moment-balance on covariates is desired without discarding observations — Hainmueller (2012) provides a reweighting approach that guarantees balance by construction.

JReviewer Checklist

Critical Reading Checklist

0 of 9 items checked0%

Is the selection-on-observables assumption explicitly stated and justified?
Are pre-treatment covariates (not post-treatment) used for matching?
Is balance after matching assessed with standardized differences (not just p-values)?
Is common support checked and visualized?
Are appropriate standard errors used (Abadie-Imbens for NN, robust for IPW)?
Is sensitivity analysis for unobserved confounders conducted (Rosenbaum bounds)?
Is the estimand (ATT vs. ATE) clearly stated?
Is a Love plot or balance table presented showing pre- and post-match standardized differences?
Are alternative matching methods (CEM, IPW, doubly robust) explored for robustness?

Paper Library

Has replication code

Foundational (13)

Abadie, A., & Imbens, G. W. (2006). Large Sample Properties of Matching Estimators for Average Treatment Effects.

EconometricaDOI: 10.1111/j.1468-0262.2006.00655.x

Abadie and Imbens derive the large-sample properties of nearest-neighbor matching estimators, showing that such estimators are not root-N consistent in general and do not attain the semiparametric efficiency bound. Their main practical contribution is a consistent analytical variance estimator that does not require nonparametric estimation of unknown functions. Bootstrap invalidity for matching is established separately in Abadie and Imbens (2008), and the bias-corrected matching estimator is developed in Abadie and Imbens (2011).

Abadie, A., & Imbens, G. W. (2008). On the Failure of the Bootstrap for Matching Estimators.

EconometricaDOI: 10.3982/ECTA6474

Abadie and Imbens show that the standard bootstrap is inconsistent for nearest-neighbor matching estimators with a fixed number of matches, even though these estimators are asymptotically normal. Researchers should use the analytical variance estimator from Abadie and Imbens (2006) instead of bootstrapping.

Abadie, A., & Imbens, G. W. (2011). Bias-Corrected Matching Estimators for Average Treatment Effects.

Journal of Business & Economic StatisticsDOI: 10.1198/jbes.2009.07333

Abadie and Imbens develop bias-corrected matching estimators that adjust for the finite-sample bias inherent in nearest-neighbor matching when matching is not exact. Their bias correction uses a regression adjustment within matched pairs and has become a standard recommendation for applied researchers using matching methods.

Cattaneo, M. D., Drukker, D. M., & Holland, A. D. (2013). Estimation of Multivalued Treatment Effects Under Conditional Independence.

Stata JournalDOI: 10.1177/1536867X1301300301

Cattaneo, Drukker, and Holland extend matching and inverse probability weighting methods to settings with multi-valued (rather than binary) treatments, developing estimators for dose-response functions under conditional independence. Their accompanying Stata implementation made these methods readily accessible to applied researchers.

Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.

Political AnalysisDOI: 10.1093/pan/mpr025

Hainmueller introduces entropy balancing, a reweighting scheme that directly targets covariate balance by finding weights that satisfy pre-specified balance constraints while remaining as close to uniform as possible. Entropy balancing has become a popular alternative to propensity score matching because it achieves exact balance on specified moments by construction.

Heckman, J. J., Ichimura, H., & Todd, P. E. (1997). Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme.

Review of Economic StudiesDOI: 10.2307/2971733

Heckman, Ichimura, and Todd develop the econometric theory behind matching estimators, including conditions for identification and the importance of common support. They apply these methods to evaluate job training programs and show when matching works well and when it does not.

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.

Political AnalysisDOI: 10.1093/pan/mpl013

Ho, Imai, King, and Stuart argue that matching should be used as a preprocessing step before parametric modeling, reducing model dependence and improving robustness of causal estimates. This influential paper reframed matching not as a standalone estimator but as a way to make subsequent parametric analyses less sensitive to specification choices.

Iacus, S. M., King, G., & Porro, G. (2012). Causal Inference without Balance Checking: Coarsened Exact Matching.

Political AnalysisDOI: 10.1093/pan/mpr013

Iacus, King, and Porro introduce Coarsened Exact Matching (CEM), which coarsens covariates into bins and then performs exact matching within those bins. CEM avoids many pitfalls of propensity score matching, such as the need to check balance iteratively, and gives the researcher direct control over the matching quality.

LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data.

American Economic Review

LaLonde compares econometric estimates of a job training program's effect with experimental benchmarks from a randomized trial, finding that non-experimental methods often failed to replicate the experimental results. This paper establishes the standard test bed for evaluating matching and other observational causal methods.

Otsu, T., & Rai, Y. (2017). Bootstrap Inference of Matching Estimators for Average Treatment Effects.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.2016.1231613

Otsu and Rai propose a weighted bootstrap procedure that is consistent for nearest-neighbor matching estimators, addressing the inconsistency of the standard nonparametric bootstrap proved by Abadie and Imbens (2008). Their weighted bootstrap correctly accounts for the matching step and recovers valid standard errors and confidence intervals for matching estimators where the naive bootstrap fails. The procedure is not yet implemented in major matching packages, so practitioners are typically still directed to the analytical Abadie-Imbens (2006) standard errors.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference.

Cambridge University PressDOI: 10.1017/CBO9780511803161

Pearl provides a comprehensive treatment of causal inference using directed acyclic graphs, the do-calculus, and structural causal models. The book formalizes the rules for reading conditional independence from graphs and establishes when causal effects are identifiable from observational data. It is the foundational reference for any researcher using DAGs to reason about confounding, mediation, and causal identification.

Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects.

BiometrikaDOI: 10.1093/biomet/70.1.41

Rosenbaum and Rubin introduce the propensity score as a dimension-reduction tool for matching, showing that conditioning on the scalar probability of treatment is sufficient to remove selection bias when the unconfoundedness assumption holds. This paper establishes the theoretical foundation for all propensity-score-based methods, including matching, stratification, and inverse probability weighting. The key practical insight is that matching on a single score avoids the curse of dimensionality that makes direct covariate matching infeasible with many confounders.

Smith, J. A., & Todd, P. E. (2005). Does Matching Overcome LaLonde's Critique of Nonexperimental Estimators?.

Journal of EconometricsDOI: 10.1016/j.jeconom.2004.04.011

Smith and Todd reexamine the Dehejia and Wahba (1999) reanalysis of LaLonde (1986), showing that the matching results are sensitive to specific sample and specification choices. They demonstrate that matching methods cannot solve fundamental problems when treated and comparison groups come from very different populations.

Application (4)

Azoulay, P., Graff Zivin, J. S., & Wang, J. (2010). Superstar Extinction.

Quarterly Journal of EconomicsDOI: 10.1162/qjec.2010.125.2.549

Azoulay and coauthors exploit the premature and unexpected deaths of 112 academic superstars as a natural experiment, using coarsened exact matching to construct a control group of comparable collaborators. They find that the death of a superstar leads to a lasting 5-8% decline in the quality-adjusted publication rates of their collaborators, with spillovers circumscribed in idea space but less so in physical or social space. This study is an elegant application of a natural experiment combined with matching in the economics of science and innovation.

Azoulay, P., Stuart, T., & Wang, Y. (2014). Matthew: Effect or Fable?.

Management ScienceDOI: 10.1287/mnsc.2013.1755

Azoulay, Stuart, and Wang investigate whether mid-career recognition (Howard Hughes Medical Institute appointment) creates a cumulative advantage or 'Matthew effect' in science. They use coarsened exact matching to construct a comparison group of equally productive scientists, addressing the selection problem inherent in studying prestigious awards. The study finds a small, short-lived citation boost to papers published before HHMI appointment, suggesting a status or halo effect on pre-existing work rather than a sustained productivity advantage.

Dehejia, R. H., & Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1999.10473858

Dehejia and Wahba show that propensity score matching can replicate experimental estimates of a job training program using observational data, revisiting LaLonde's influential critique. The paper demonstrates the practical value of matching by showing that propensity score methods yield estimates much closer to the experimental benchmark than the nonexperimental estimators LaLonde had examined.

Imbens, G. W. (2015). Matching Methods in Practice: Three Examples.

Journal of Human ResourcesDOI: 10.3368/jhr.50.2.373

Imbens demonstrates how to implement matching methods in practice through three detailed empirical examples, covering propensity score estimation, covariate balance assessment, overlap and trimming, and robustness to alternative estimators. This paper is an invaluable practical guide that bridges the gap between matching theory and applied research.

Survey (9)

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.

Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.

Review of Economics and StatisticsDOI: 10.1162/003465304323023651

Imbens provides a comprehensive review of nonparametric methods for estimating average treatment effects under the unconfoundedness assumption, covering matching, weighting, and subclassification estimators. This survey unifies the theoretical foundations of matching methods and clarified the connections between different estimators used in program evaluation.

Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction.

Cambridge University PressDOI: 10.1017/CBO9781139025751

Imbens and Rubin provide a comprehensive textbook grounding causal inference in the potential outcomes framework, with detailed treatment of matching, propensity scores, and subclassification. They provide rigorous foundations for selection-on-observables methods.

King, G., & Nielsen, R. (2019). Why Propensity Scores Should Not Be Used for Matching.

Political AnalysisDOI: 10.1017/pan.2019.11

King and Nielsen argue that propensity score matching can increase imbalance, model dependence, and bias relative to other matching methods. This provocative paper has influenced a shift toward alternatives like CEM and Mahalanobis distance matching in applied research.

Rathje, J., Katila, R., & Reineke, P. (2024). Making the Most of AI and Machine Learning in Organizations and Strategy Research: Supervised Machine Learning, Causal Inference, and Matching Models.

Strategic Management JournalDOI: 10.1002/smj.3604

Rathje, Katila, and Reineke review how supervised machine learning can support causal-inference workflows in strategy research, with emphasis on two-stage matching models for sample-selection problems. Using technology invention data, they demonstrate ML-based approaches to covariate selection and matching while discussing the broader potential and limits of ML in organizational research.

Rosenbaum, P. R. (2002). Observational Studies.

SpringerDOI: 10.1007/978-1-4757-3692-2

Rosenbaum provides the standard textbook on observational study design, covering matching, sensitivity analysis, and design principles for drawing causal inferences from non-experimental data. His framework for sensitivity analysis (Rosenbaum bounds) is the standard tool for assessing how much unobserved confounding would be needed to overturn a matching-based finding.

Shipman, J. E., Swanquist, Q. T., & Whited, R. L. (2017). Propensity Score Matching in Accounting Research.

The Accounting ReviewDOI: 10.2308/accr-51449

Shipman, Swanquist, and Whited review how propensity score matching is used (and sometimes misused) in accounting research. They provide practical guidelines on common pitfalls such as matching on post-treatment variables, inadequate balance checks, and ignoring the unconfoundedness assumption.

Stuart, E. A. (2010). Matching Methods for Causal Inference: A Review and a Look Forward.

Statistical ScienceDOI: 10.1214/09-STS313

Stuart provides a comprehensive review of matching methods including propensity score matching, Mahalanobis distance matching, and coarsened exact matching, with practical guidance on implementation. She offers an accessible overview of when and how to use different matching approaches.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.

MIT Press

Wooldridge's graduate textbook covers duration and hazard models in Chapter 22, including the Cox proportional hazard model, parametric alternatives (Weibull, exponential), and the treatment of censoring and truncation in survival data.

One-Line Implementation

Download Full Analysis Code

Motivating Example: Evaluating a Job Training Program#

AOverview#

The Core Problem#

The Selection-on-Observables Assumption#

The Four Main Matching Approaches#

Common Confusions#

BIdentification#

Propensity Score Theorem#

Common Support#

SUTVA (Stable Unit Treatment Value Assumption)#

The Estimand: ATT vs. ATE#

CVisual Intuition#

DMathematical Derivation#

EImplementation#

FDiagnostics#

Standardized Differences#

Common Support Diagnostics#

Variance Ratios#

Distribution Checks#

Interpreting Your Results#

GWhat Can Go Wrong#

Matching on a Post-Treatment Variable (Collider Bias)

Poor Common Support (Extrapolation)

Using Bootstrap Standard Errors with Nearest-Neighbor Matching

HPractice#

Paper Summary

Key Table

Authors' Identification Claim

ISwap-In: When to Use Something Else#

JReviewer Checklist#

Critical Reading Checklist

Paper Library

Foundational (13)

Application (4)

Survey (9)

Tags

Motivating Example: Evaluating a Job Training Program

AOverview

The Core Problem

The Selection-on-Observables Assumption

The Four Main Matching Approaches

Common Confusions

BIdentification

Propensity Score Theorem

Common Support

SUTVA (Stable Unit Treatment Value Assumption)

The Estimand: ATT vs. ATE

CVisual Intuition

DMathematical Derivation

EImplementation

FDiagnostics

Standardized Differences

Common Support Diagnostics

Variance Ratios

Distribution Checks

Interpreting Your Results

GWhat Can Go Wrong

HPractice

ISwap-In: When to Use Something Else

JReviewer Checklist