Matching (PSM, CEM, NN, Weighting)
Reduces selection bias by comparing treated units to similar control units based on observed characteristics.
One-Line Implementation
matchit(treatment ~ x1 + x2, data = df, method = 'nearest', distance = 'glm')teffects psmatch (outcome) (treatment x1 x2), atetCausalModel(Y, D, X).est_via_matching() # causalinferenceDownload Full Analysis Code
Complete scripts with diagnostics, robustness checks, and result export.
Motivating Example: Evaluating a Job Training Program
In the 1970s, the National Supported Work (NSW) Demonstration randomly assigned disadvantaged workers to a job training program. The experimental data showed the program raised earnings by about $1,800 per year.
But what if the experiment had never been run? Could you recover the treatment effect using observational data alone? Dehejia and Wahba (1999) took the treated group from the NSW experiment and matched them to comparison groups drawn from large survey datasets (the CPS and PSID). Using matching on pre-treatment covariates (age, education, earnings history, race, marital status), they recovered estimates remarkably close to the experimental benchmark.
This paper became both a prominent advertisement and a cautionary tale for matching. It showed matching can work — but only when you have the right covariates and the comparison group overlaps well with the treated group. Matching has since become a standard tool across disciplines — for example, Azoulay et al. (2014) employed matching to test whether the Matthew effect in science reflects genuine cumulative advantage or selection.
had previously shown that naive observational methods (including OLS) failed badly at recovering the experimental benchmark. Matching was an improvement — but not a magic bullet.
AOverview
The Core Problem
In observational studies, treated and control units differ systematically. Comparing their average outcomes confounds the treatment effect with selection bias. Matching addresses this by finding, for each treated unit, one or more control units that are "similar" on observed characteristics.
The Selection-on-Observables Assumption
All matching methods rest on:
This condition says: conditional on observed covariates , treatment assignment is independent of . In words, once you condition on the right set of observables, treated and control units are comparable — as if treatment were randomly assigned within strata of . is the same identifying assumption needed for any selection-on-observables approach, including OLS when used for causal inference. The advantage of matching is that it achieves nonparametric identification of the treatment effect without relying on a specific functional form for the outcome equation.
This assumption is also called , unconfoundedness, or ignorability.
The Four Main Matching Approaches
| Method | How It Matches | Key Feature |
|---|---|---|
| Propensity Score Matching (PSM) | Match on estimated probability of treatment: $e(X_i) = P(D_i = 1 | X_i)$ |
| Coarsened Exact Matching (CEM) | Coarsen covariates into bins, then exact-match within bins | Avoids propensity score estimation; transparent |
| Nearest-Neighbor (NN) | Match each treated unit to the closest control unit(s) in covariate space | Simple and intuitive; can match on Mahalanobis distance |
| Inverse Probability Weighting (IPW) | Weight observations by inverse of propensity score | Uses all observations (no dropping); semi-parametric |
Common Confusions
BIdentification
Propensity Score Theorem
Rosenbaum and Rubin (1983) proved a remarkable result: if conditional independence holds given , then it also holds given only the :
This theorem reduces the matching problem from high-dimensional covariate space to a single dimension. Instead of finding units identical on age, education, income, race, etc., you match on a single number — the estimated probability of being treated.
Common Support
In addition to conditional independence, you need (or overlap):
For every value of the covariates, there must be both treated and control units. If treated units have no comparable controls (e.g., all high-income individuals are treated), the treatment effect is not identified in that region.
SUTVA (Stable Unit Treatment Value Assumption)
No interference between units: one unit's treatment assignment does not affect another unit's outcomes. In matching, this means that being matched (or unmatched) does not change anyone's potential outcomes, and there are no hidden variations of treatment — each treated unit receives the same version of the intervention. If the job training program displaced untrained workers from jobs, the control group's outcomes would be contaminated by .
The Estimand: ATT vs. ATE
- ATT (): For each treated unit, find a matched control and compute the difference. Most common with matching.
- ATE (): Match in both directions — find controls for treated units AND find treated units for controls. Requires stronger overlap.
CVisual Intuition
Imagine a two-dimensional scatterplot. The horizontal axis is age, the vertical axis is income. Red dots are treated individuals; blue dots are controls. Without matching, the red and blue dots occupy different regions — treated units are younger and lower-income.
Matching finds, for each red dot, the nearest blue dot (or a set of nearby blue dots). After matching, the remaining comparison group (the matched blues) has the same age/income distribution as the treated group. The distribution of covariates is balanced.
With propensity score matching, you collapse both dimensions into one score, align treated and control units on that score, and compare. The balance is not on any single covariate but on the overall probability of treatment.
DMathematical Derivation
Don't worry about the notation yet — here's what this means in words: For each treated unit, find one or more controls with similar propensity scores and take the average difference. This procedure removes bias from observed confounders.
Propensity score estimation (first step):
Estimate using logit:
Matching (second step):
For each treated unit with , define the matched set as the control unit(s) closest in propensity score:
ATT estimation:
IPW alternative (for ATT):
The IPW estimator reweights the control group to look like the treated group, using the propensity score. The weights tilt the control distribution toward the treated covariate profile.
Standard errors: Because the propensity score is estimated, standard errors must account for this estimation uncertainty. Abadie and Imbens (2008) showed that the standard nonparametric bootstrap is inconsistent for nearest-neighbor matching with replacement; use the analytical formula from Abadie and Imbens (2006) instead. For IPW, bootstrap or robust SEs work.
EImplementation
# Requires: MatchIt, cobalt, lmtest, sandwich, WeightIt
library(MatchIt)
library(cobalt)
# --- Step 1: Propensity Score Matching ---
# matchit() estimates a logit model for P(treatment|X), then matches
# treated units to their nearest control by propensity score distance.
# method="nearest" = greedy nearest-neighbor; replace=FALSE = without replacement
# ratio=1 = one control per treated unit (1:1 matching)
m_out <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
data = df, method = "nearest", distance = "logit",
replace = FALSE, ratio = 1)
# --- Step 2: Check Covariate Balance ---
# Balance diagnostics are essential: matching only works if it creates
# comparable groups. summary() shows standardized mean differences (SMDs).
summary(m_out)
# love.plot() visualizes balance; threshold of 0.1 SMD is a common standard.
# All covariates should fall below the threshold after matching.
love.plot(m_out, thresholds = c(m = .1))
# --- Step 3: Estimate Treatment Effect on Matched Data ---
# match.data() extracts the matched sample with matching weights.
# Regress the outcome on treatment using matched data with weights.
m_data <- match.data(m_out)
library(lmtest)
library(sandwich)
fit <- lm(earnings ~ training, data = m_data, weights = weights)
# HC1 robust SEs account for heteroskedasticity in the matched sample.
# The coefficient on "training" estimates the ATT.
coeftest(fit, vcov. = vcovHC(fit, type = "HC1"))
# --- Step 4: Coarsened Exact Matching (CEM) ---
# CEM coarsens continuous covariates into bins, then matches exactly on bins.
# Guarantees balance within strata; unmatched units are discarded.
m_cem <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
data = df, method = "cem")
summary(m_cem)
# --- Step 5: Inverse Probability Weighting (IPW) ---
# IPW reweights observations by 1/P(treatment|X) to create a pseudo-population
# where treatment is independent of covariates. estimand="ATT" targets
# the average treatment effect on the treated.
library(WeightIt)
w_out <- weightit(training ~ age + education + re74 + re75 + black + hispanic,
data = df, method = "ps", estimand = "ATT")
# Check balance after weighting; threshold of 0.1 SMD
bal.tab(w_out, thresholds = 0.1)FDiagnostics
Standardized Differences
A widely recommended metric for assessing balance. For each covariate:
Common Support Diagnostics
- Plot the distribution of propensity scores for treated and control groups. They should overlap substantially.
- Trim observations outside the region of common support (propensity scores below the minimum or above the maximum of the other group).
- If trimming drops many observations, your comparison is fragile and results may not generalize.
Variance Ratios
Beyond means, check that the variance of each covariate is similar across groups after matching. A variance ratio (treated/control) far from 1.0 indicates remaining imbalance in the tails of the distribution.
Distribution Checks
For critical covariates, compare the full distributions (not just means) using QQ plots or Kolmogorov-Smirnov tests.
Interpreting Your Results
- It is important to present the ATT (effect on the treated) or ATE, and be explicit about which you estimate.
- Matching estimates are only as credible as the conditional independence assumption. If you cannot convincingly argue that you have captured all confounders, say so.
- Report both the matched and unmatched estimates. The difference tells you how much bias matching removes (at least for observed confounders).
- Consider sensitivity analysis (Rosenbaum bounds) to assess how much unobserved confounding would be needed to overturn your results (Rosenbaum, 2002).
GWhat Can Go Wrong
| Problem | What It Does | How to Fix It |
|---|---|---|
| Unobserved confounders | Matching cannot fix what it cannot see — estimates are biased | Sensitivity analysis (Rosenbaum bounds); combine with IV or DiD |
| Matching on post-treatment variables | Introduces collider bias | Match only on pre-treatment covariates |
| Poor common support | Comparisons are based on extrapolation | Trim, restrict sample, report trimmed and full results |
| Over-reliance on PSM | PSM can worsen balance (King & Nielsen, 2019) | Consider CEM, entropy balancing, or doubly robust methods |
| Wrong standard errors | Bootstrap is invalid for NN matching | Use Abadie-Imbens analytical SEs; or use IPW with robust SEs |
| Too many matching variables | Curse of dimensionality; poor matches | Use propensity score or dimension-reducing methods |
| Caliper too wide | Matches are poor quality | Tighten the caliper; report balance |
Matching on a Post-Treatment Variable (Collider Bias)
Match job training participants to non-participants using only pre-treatment covariates: age, education, prior earnings (re74, re75), race, marital status
ATT estimate = $1,794 (close to the NSW experimental benchmark of $1,800). Balance is good on all pre-treatment covariates.
Poor Common Support (Extrapolation)
Treated and control groups have substantial overlap in propensity score distributions
After trimming the 5% of treated units outside common support, ATT = $1,650 (SE = $520). 95% of treated units have well-matched controls. Results are stable to trimming thresholds.
Using Bootstrap Standard Errors with Nearest-Neighbor Matching
Use Abadie-Imbens analytical standard errors for nearest-neighbor matching
ATT = $1,794, Abadie-Imbens SE = $680, 95% CI: [$461, $3,127]. Coverage is correct.
After propensity score matching, you find that the standardized difference for pre-treatment earnings (re75) is 0.03, but the standardized difference for age is 0.22. What should you do?
HPractice
A researcher matches treated and control units on propensity scores and finds excellent balance: all standardized differences are below 0.05. She concludes that the matching estimate is now as credible as an experimental estimate. What is wrong with this claim?
You estimate the ATT of a job training program using propensity score matching. After matching, you find that 40% of treated units are dropped because they fall outside the region of common support. What should you be most concerned about?
A researcher matches charter school students to public school students using pre-treatment test scores, parental income, and race. She also includes 'current school satisfaction' as a matching variable. What is the problem?
After nearest-neighbor propensity score matching (1:1 without replacement), a researcher uses bootstrap standard errors with 500 replications. A reviewer objects. Why?
Propensity Score Matching: Job Training and Earnings
A policy analyst wants to estimate the effect of a voluntary job training program on participants' earnings two years later. Participants self-selected into training, so treated and control workers may differ in age, education, and prior work history. She plans to use propensity score matching to create a comparable control group.
Read the analysis below carefully and identify the errors.
A health researcher studies whether a new surgical technique reduces recovery time. They use propensity score matching on patient age, BMI, and insurance type to compare 200 patients who received the new technique to 200 matched controls who received the standard procedure. They report:
"After matching, standardized differences are: age (0.04), BMI (0.06), insurance type (0.02). Balance is excellent. The matched estimate shows a 3.2-day reduction in recovery time (p = 0.001). We use bootstrap standard errors (500 replications). Since treated and control patients are balanced on all observables, the estimate is causal."
They do not report any sensitivity analysis.
Select all errors you can find:
Read the analysis below carefully and identify the errors.
An education researcher matches charter school students to traditional public school students using propensity scores estimated from a logistic regression. The propensity score model includes: parental income, parental education, student's 4th-grade test scores, and student race. After 1:1 nearest-neighbor matching:
"Standardized differences are below 0.10 for all variables. The matched charter school students score 0.25 standard deviations higher on 8th-grade math tests (p < 0.05)."
The researcher also reports that 150 of the 500 charter school students were dropped because they fell outside common support (propensity scores above 0.95).
Select all errors you can find:
Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.
Paper Summary
A management study examines whether firms that adopt corporate social responsibility (CSR) practices have better financial performance. The authors use propensity score matching on firm size, industry, profitability, and leverage. They match CSR-adopting firms to non-adopting firms and find a positive effect of CSR on Tobin's Q. They conclude that CSR causes better firm performance.
Key Table
| Variable | Pre-match SD | Post-match SD |
|---|---|---|
| Firm size (log) | 0.45 | 0.04 |
| Industry | 0.12 | 0.02 |
| Profitability | 0.38 | 0.07 |
| Leverage | 0.22 | 0.09 |
| N (treated) | 500 | 480 |
| N (control) | 3,200 | 480 |
Authors' Identification Claim
After propensity score matching, treated and control firms are balanced on all observed characteristics. Therefore, the difference in Tobin's Q reflects the causal effect of CSR adoption.
ISwap-In: When to Use Something Else
- Difference-in-differences: When a policy change creates a natural experiment with temporal variation — DiD does not require selection on observables alone.
- IV / 2SLS: When an instrument is available to address selection on unobservables that matching cannot handle.
- Doubly robust estimation: When you want robustness to misspecification of either the outcome model or the propensity score model — combines matching/weighting logic with regression adjustment.
- DML: When the covariate space is high-dimensional and linear specifications may miss important nonlinearities — DML uses machine learning for nuisance estimation with valid post-selection inference. Recent work by Rathje et al. (2024) demonstrates how machine learning methods can improve matching quality in management research.
- Entropy balancing: When exact moment-balance on covariates is desired without discarding observations — Hainmueller (2012) provides a reweighting approach that guarantees balance by construction.
JReviewer Checklist
Critical Reading Checklist
Paper Library
Foundational (12)
Abadie, A., & Imbens, G. W. (2006). Large Sample Properties of Matching Estimators for Average Treatment Effects.
Abadie and Imbens derive the large-sample properties of nearest-neighbor matching estimators, showing that such estimators are not root-N consistent in general and do not attain the semiparametric efficiency bound. Their main practical contribution is a consistent analytical variance estimator that does not require nonparametric estimation of unknown functions. Bootstrap invalidity for matching is established separately in Abadie and Imbens (2008), and the bias-corrected matching estimator is developed in Abadie and Imbens (2011).
Abadie, A., & Imbens, G. W. (2008). On the Failure of the Bootstrap for Matching Estimators.
Abadie and Imbens show that the standard bootstrap is inconsistent for nearest-neighbor matching estimators with a fixed number of matches, even though these estimators are asymptotically normal. Researchers should use the analytical variance estimator from Abadie and Imbens (2006) instead of bootstrapping.
Abadie, A., & Imbens, G. W. (2011). Bias-Corrected Matching Estimators for Average Treatment Effects.
Abadie and Imbens develop bias-corrected matching estimators that adjust for the finite-sample bias inherent in nearest-neighbor matching when matching is not exact. Their bias correction uses a regression adjustment within matched pairs and has become a standard recommendation for applied researchers using matching methods.
Cattaneo, M. D., Drukker, D. M., & Holland, A. D. (2013). Estimation of Multivalued Treatment Effects Under Conditional Independence.
Cattaneo, Drukker, and Holland extend matching and inverse probability weighting methods to settings with multi-valued (rather than binary) treatments, developing estimators for dose-response functions under conditional independence. Their accompanying Stata implementation made these methods readily accessible to applied researchers.
Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.
Hainmueller introduces entropy balancing, a reweighting scheme that directly targets covariate balance by finding weights that satisfy pre-specified balance constraints while remaining as close to uniform as possible. Entropy balancing has become a popular alternative to propensity score matching because it achieves exact balance on specified moments by construction.
Heckman, J. J., Ichimura, H., & Todd, P. E. (1997). Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme.
Heckman, Ichimura, and Todd develop the econometric theory behind matching estimators, including conditions for identification and the importance of common support. They apply these methods to evaluate job training programs and show when matching works well and when it does not.
Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.
Ho, Imai, King, and Stuart argue that matching should be used as a preprocessing step before parametric modeling, reducing model dependence and improving robustness of causal estimates. This influential paper reframed matching not as a standalone estimator but as a way to make subsequent parametric analyses less sensitive to specification choices.
Iacus, S. M., King, G., & Porro, G. (2012). Causal Inference without Balance Checking: Coarsened Exact Matching.
Iacus, King, and Porro introduce Coarsened Exact Matching (CEM), which coarsens covariates into bins and then performs exact matching within those bins. CEM avoids many pitfalls of propensity score matching, such as the need to check balance iteratively, and gives the researcher direct control over the matching quality.
LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data.
LaLonde compares econometric estimates of a job training program's effect with experimental benchmarks from a randomized trial, finding that non-experimental methods often failed to replicate the experimental results. This paper establishes the standard test bed for evaluating matching and other observational causal methods.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference.
Pearl provides a comprehensive treatment of causal inference using directed acyclic graphs, the do-calculus, and structural causal models. The book formalizes the rules for reading conditional independence from graphs and establishes when causal effects are identifiable from observational data. It is the foundational reference for any researcher using DAGs to reason about confounding, mediation, and causal identification.
Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects.
Rosenbaum and Rubin introduce the propensity score as a dimension-reduction tool for matching, showing that conditioning on the scalar probability of treatment is sufficient to remove selection bias when the unconfoundedness assumption holds. This paper establishes the theoretical foundation for all propensity-score-based methods, including matching, stratification, and inverse probability weighting. The key practical insight is that matching on a single score avoids the curse of dimensionality that makes direct covariate matching infeasible with many confounders.
Smith, J. A., & Todd, P. E. (2005). Does Matching Overcome LaLonde's Critique of Nonexperimental Estimators?.
Smith and Todd reexamine the Dehejia and Wahba (1999) reanalysis of LaLonde (1986), showing that the matching results are sensitive to specific sample and specification choices. They demonstrate that matching methods cannot solve fundamental problems when treated and comparison groups come from very different populations.
Application (4)
Azoulay, P., Graff Zivin, J. S., & Wang, J. (2010). Superstar Extinction.
Azoulay and coauthors exploit the premature and unexpected deaths of 112 academic superstars as a natural experiment, using coarsened exact matching to construct a control group of comparable collaborators. They find that the death of a superstar leads to a lasting 5-8% decline in the quality-adjusted publication rates of their collaborators, with spillovers circumscribed in idea space but less so in physical or social space. This study is an elegant application of a natural experiment combined with matching in the economics of science and innovation.
Azoulay, P., Stuart, T., & Wang, Y. (2014). Matthew: Effect or Fable?.
Azoulay, Stuart, and Wang investigate whether mid-career recognition (Howard Hughes Medical Institute appointment) creates a cumulative advantage or 'Matthew effect' in science. They use coarsened exact matching to construct a comparison group of equally productive scientists, addressing the selection problem inherent in studying prestigious awards. The study finds a small, short-lived citation boost to papers published before HHMI appointment, suggesting a status or halo effect on pre-existing work rather than a sustained productivity advantage.
Dehejia, R. H., & Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.
Dehejia and Wahba show that propensity score matching can replicate experimental estimates of a job training program using observational data, revisiting LaLonde's influential critique. The paper demonstrates the practical value of matching by showing that propensity score methods yield estimates much closer to the experimental benchmark than the nonexperimental estimators LaLonde had examined.
Imbens, G. W. (2015). Matching Methods in Practice: Three Examples.
Imbens demonstrates how to implement matching methods in practice through three detailed empirical examples, covering propensity score estimation, covariate balance assessment, overlap and trimming, and robustness to alternative estimators. This paper is an invaluable practical guide that bridges the gap between matching theory and applied research.
Survey (9)
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.
Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.
Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.
Imbens provides a comprehensive review of nonparametric methods for estimating average treatment effects under the unconfoundedness assumption, covering matching, weighting, and subclassification estimators. This survey unifies the theoretical foundations of matching methods and clarifies the connections between different estimators used in program evaluation.
Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction.
Imbens and Rubin provide a comprehensive textbook grounding causal inference in the potential outcomes framework, with detailed treatment of matching, propensity scores, and subclassification. They provide rigorous foundations for selection-on-observables methods.
King, G., & Nielsen, R. (2019). Why Propensity Scores Should Not Be Used for Matching.
King and Nielsen argue that propensity score matching can increase imbalance, model dependence, and bias relative to other matching methods. This provocative paper has influenced a shift toward alternatives like CEM and Mahalanobis distance matching in applied research.
Rathje, J., Katila, R., & Reineke, P. (2024). Making the Most of AI and Machine Learning in Organizations and Strategy Research: Supervised Machine Learning, Causal Inference, and Matching Models.
Rathje, Katila, and Reineke review how supervised machine learning can support causal-inference workflows in strategy research, with emphasis on two-stage matching models for sample-selection problems. Using technology invention data, they demonstrate ML-based approaches to covariate selection and matching while discussing the broader potential and limits of ML in organizational research.
Rosenbaum, P. R. (2002). Observational Studies.
Rosenbaum provides the standard textbook on observational study design, covering matching, sensitivity analysis, and design principles for drawing causal inferences from non-experimental data. His framework for sensitivity analysis (Rosenbaum bounds) is the standard tool for assessing how much unobserved confounding would be needed to overturn a matching-based finding.
Shipman, J. E., Swanquist, Q. T., & Whited, R. L. (2017). Propensity Score Matching in Accounting Research.
Shipman, Swanquist, and Whited review how propensity score matching is used (and sometimes misused) in accounting research. They provide practical guidelines on common pitfalls such as matching on post-treatment variables, inadequate balance checks, and ignoring the unconfoundedness assumption.
Stuart, E. A. (2010). Matching Methods for Causal Inference: A Review and a Look Forward.
Stuart provides a comprehensive review of matching methods including propensity score matching, Mahalanobis distance matching, and coarsened exact matching, with practical guidance on implementation. She offers an accessible overview of when and how to use different matching approaches.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.
Wooldridge's graduate textbook is the standard reference for cross-section and panel data econometrics. Chapters 10-11 provide a thorough treatment of fixed effects, random effects, and related panel data methods, while later chapters cover general estimation methodology (MLE, GMM, M-estimation) with panel data applications throughout. The book covers both linear and nonlinear models with careful attention to assumptions.