Matching (PSM, CEM, NN, Weighting)
Reduces selection bias by comparing treated units to similar control units based on observed characteristics.
Quick Reference
- When to Use
- When selection into treatment depends only on observed covariates (selection on observables / conditional independence), and you want a transparent, nonparametric comparison of treated and control units.
- Key Assumption
- Conditional independence (unconfoundedness): conditional on observed pre-treatment covariates, treatment assignment is independent of potential outcomes. Also requires overlap (common support) — for every covariate profile, both treated and control units must exist.
- Common Mistake
- Matching on post-treatment variables (which introduces collider bias), or failing to assess and report balance after matching. Checking standardized mean differences is a standard diagnostic step.
- Estimated Time
- 3 hours
One-Line Implementation
teffects psmatch (outcome) (treatment x1 x2), atetmatchit(treatment ~ x1 + x2, data = df, method = 'nearest', distance = 'glm')CausalModel(Y, D, X).est_via_matching() # causalinferenceDownload Full Analysis Code
Complete scripts with diagnostics, robustness checks, and result export.
Motivating Example: Evaluating a Job Training Program
In the 1970s, the National Supported Work (NSW) Demonstration randomly assigned disadvantaged workers to a job training program. The experimental data showed the program raised earnings by about $1,800 per year.
But what if the experiment had never been run? Could you recover the treatment effect using observational data alone? Dehejia and Wahba (1999) took the treated group from the NSW experiment and matched them to comparison groups drawn from large survey datasets (the CPS and PSID). Using propensity score matching on pre-treatment covariates (age, education, earnings history, race, marital status), they recovered estimates remarkably close to the experimental benchmark.
(Dehejia & Wahba, 1999)This paper became both a prominent advertisement and a cautionary tale for matching. It showed matching can work — but only when you have the right covariates and the comparison group overlaps well with the treated group.
had previously shown that naive observational methods (including OLS) failed badly at recovering the experimental benchmark. Matching was an improvement — but not a magic bullet.
A. Overview: The Idea Behind Matching
The Core Problem
In observational studies, treated and control units differ systematically. Comparing their average outcomes confounds the treatment effect with selection bias. Matching addresses this by finding, for each treated unit, one or more control units that are "similar" on observed characteristics.
The Selection-on-Observables Assumption
All matching methods rest on:
This condition says: conditional on observed covariates , treatment assignment is independent of . In words, once you condition on the right set of observables, treated and control units are comparable — as if treatment were randomly assigned within strata of . Conditional independence is a much stronger assumption than what is needed in OLS (which requires the weaker mean-independence condition), but matching gives you nonparametric identification of the treatment effect without relying on a specific functional form.
This assumption is also called conditional independence, unconfoundedness, or ignorability.
The Four Main Matching Approaches
| Method | How It Matches | Key Feature |
|---|---|---|
| Propensity Score Matching (PSM) | Match on estimated probability of treatment: $e(X_i) = P(D_i=1 | X_i)$ |
| Coarsened Exact Matching (CEM) | Coarsen covariates into bins, then exact-match within bins | Avoids propensity score estimation; transparent |
| Nearest-Neighbor (NN) | Match each treated unit to the closest control unit(s) in covariate space | Simple and intuitive; can match on Mahalanobis distance |
| Inverse Probability Weighting (IPW) | Weight observations by inverse of propensity score | Uses all observations (no dropping); semi-parametric |
Common Confusions
(King & Nielsen, 2019)B. Identification
Propensity Score Theorem
Rosenbaum and Rubin (1983) proved a remarkable result: if conditional independence holds given , then it also holds given only the propensity score :
(Rosenbaum & Rubin, 1983)This theorem reduces the matching problem from high-dimensional covariate space to a single dimension. Instead of finding units identical on age, education, income, race, etc., you match on a single number — the estimated probability of being treated.
Common Support
In addition to conditional independence, you need common support (or overlap):
For every value of the covariates, there must be both treated and control units. If treated units have no comparable controls (e.g., all high-income individuals are treated), the treatment effect is not identified in that region.
The Estimand: ATT vs. ATE
- ATT (): For each treated unit, find a matched control and compute the difference. Most common with matching.
- ATE (): Match in both directions — find controls for treated units AND find treated units for controls. Requires stronger overlap.
C. Visual Intuition
Imagine a two-dimensional scatterplot. The horizontal axis is age, the vertical axis is income. Red dots are treated individuals; blue dots are controls. Without matching, the red and blue dots occupy different regions — treated units are younger and lower-income.
Matching finds, for each red dot, the nearest blue dot (or a set of nearby blue dots). After matching, the remaining comparison group (the matched blues) has the same age/income distribution as the treated group. The distribution of covariates is balanced.
With propensity score matching, you collapse both dimensions into one score, align treated and control units on that score, and compare. The balance is not on any single covariate but on the overall probability of treatment.
Propensity Score Overlap
As selection into treatment grows stronger, the propensity score distributions for treated and control units separate, reducing common support. The matched estimate improves over the naive estimate only when overlap is sufficient.
Computed Results
- Common Support (% of treated)
- 82.0
- Naive (unadjusted) Estimate
- 3.20
- Matched Estimate
- 2.15
Why Matching?
DGP: D depends on X via logistic selection (strength = 1.0); Y = 2.0·D + 2·X + ε. N = 200 (95 treated, 105 control). 42 matched pairs formed.
Estimation Results
| Estimator | β̂ | SE | 95% CI | Bias |
|---|---|---|---|---|
| Naive diff. in means | 6.730 | 0.430 | [5.89, 7.57] | +4.730 |
| OLS controlling for Xclosest | 2.130 | 0.139 | [1.86, 2.40] | +0.130 |
| NN matching | 2.338 | 0.190 | [1.97, 2.71] | +0.338 |
| True β | 2.000 | — | — | — |
Total number of observations
The causal effect of treatment on outcome
How strongly X determines treatment (0 = random assignment)
Why the difference?
The naive difference in means is biased (+4.73) because treatment assignment depends on X (selection strength = 1.0). Treated units have systematically different X values, and since X directly affects Y, the simple comparison confounds the treatment effect with the effect of X. OLS controlling for X removes most of the bias (β̂ = 2.130) by linearly adjusting for the confounding. Nearest-neighbor matching achieves a similar correction (β̂ = 2.338) by comparing each treated unit only to a control unit with a similar X value—ensuring apples-to-apples comparisons without imposing a linear functional form. Note: some treated units could not be matched, indicating limited overlap (common support). Matching drops these units, while OLS extrapolates.
Common Support Detective
Matching and weighting estimators require common support: for each treated unit, there should be comparable control units (and vice versa). Explore how selection strength affects the overlap of propensity score distributions and what happens to the ATT estimate when you trim to the region of common support.
How strongly X predicts treatment assignment (0 = random, 2 = strong selection)
Number of treated units
Number of control units
Common Support
| Overlap region | [0.24, 0.88] |
| Units in support | 142 T + 138 C |
| Overlap fraction | 93% |
ATT Estimates
| Method | ATT |
|---|---|
| Naive (full sample) | 5.059 |
| Naive (common support) | 4.637 |
| True ATT | 3.445 |
| Change from trimming | 0.422 |
Good overlap. 93% of units are in common support. Trimming has a small effect on the estimate, suggesting the comparison is well-supported across the propensity score distribution.
D. Mathematical Derivation
Don't worry about the notation yet — here's what this means in words: For each treated unit, find one or more controls with similar propensity scores and take the average difference. This removes bias from observed confounders.
Propensity score estimation (first step):
Estimate using logit:
Matching (second step):
For each treated unit with , define the matched set as the control unit(s) closest in propensity score:
ATT estimation:
IPW alternative (for ATT):
The IPW estimator reweights the control group to look like the treated group, using the propensity score. The weights tilt the control distribution toward the treated covariate profile.
Standard errors: Because the propensity score is estimated, standard errors must account for this estimation uncertainty. (Abadie & Imbens, 2008) showed that bootstrap is invalid for nearest-neighbor matching; use the analytical formula from Abadie and Imbens (2006) instead. For IPW, bootstrap or robust SEs work.
(Abadie & Imbens, 2006)E. Implementation
library(MatchIt)
library(cobalt)
# Propensity score matching (nearest neighbor, 1:1)
m_out <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
data = df, method = "nearest", distance = "logit",
replace = FALSE, ratio = 1)
# Check balance
summary(m_out)
love.plot(m_out, thresholds = c(m = .1)) # Standardized mean difference threshold
# Extract matched data and estimate treatment effect
m_data <- match.data(m_out)
library(lmtest)
library(sandwich)
fit <- lm(earnings ~ training, data = m_data, weights = weights)
coeftest(fit, vcov. = vcovHC(fit, type = "HC1"))
# CEM
m_cem <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
data = df, method = "cem")
summary(m_cem)
# IPW
library(WeightIt)
w_out <- weightit(training ~ age + education + re74 + re75 + black + hispanic,
data = df, method = "ps", estimand = "ATT")
bal.tab(w_out, thresholds = 0.1)F. Diagnostics: Balance Is Everything
Standardized Differences
The most widely recommended metric for assessing balance. For each covariate:
Common Support Diagnostics
- Plot the distribution of propensity scores for treated and control groups. They should overlap substantially.
- Trim observations outside the region of common support (propensity scores below the minimum or above the maximum of the other group).
- If trimming drops many observations, your comparison is fragile and results may not generalize.
Variance Ratios
Beyond means, check that the variance of each covariate is similar across groups after matching. A variance ratio (treated/control) far from 1.0 indicates remaining imbalance in the tails of the distribution.
Distribution Checks
For critical covariates, compare the full distributions (not just means) using QQ plots or Kolmogorov-Smirnov tests.
Interpreting Results
- It is important to present the ATT (effect on the treated) or ATE, and be explicit about which you estimate.
- Matching estimates are only as credible as the conditional independence assumption. If you cannot convincingly argue that you have captured all confounders, say so.
- Report both the matched and unmatched estimates. The difference tells you how much bias matching removes (at least for observed confounders).
- Consider sensitivity analysis (Rosenbaum bounds) to assess how much unobserved confounding would be needed to overturn your results.
G. What Can Go Wrong
| Problem | What It Does | How to Fix It |
|---|---|---|
| Unobserved confounders | Matching cannot fix what it cannot see — estimates are biased | Sensitivity analysis (Rosenbaum bounds); combine with IV or DiD |
| Matching on post-treatment variables | Introduces collider bias | Match only on pre-treatment covariates |
| Poor common support | Comparisons are based on extrapolation | Trim, restrict sample, report trimmed and full results |
| Over-reliance on PSM | PSM can worsen balance (King & Nielsen, 2019) | Consider CEM, entropy balancing, or doubly robust methods |
| Wrong standard errors | Bootstrap is invalid for NN matching | Use Abadie-Imbens analytical SEs; or use IPW with robust SEs |
| Too many matching variables | Curse of dimensionality; poor matches | Use propensity score or dimension-reducing methods |
| Caliper too wide | Matches are poor quality | Tighten the caliper; report balance |
Matching on a Post-Treatment Variable (Collider Bias)
Match job training participants to non-participants using only pre-treatment covariates: age, education, prior earnings (re74, re75), race, marital status
ATT estimate = $1,794 (close to the NSW experimental benchmark of $1,800). Balance is good on all pre-treatment covariates.
Poor Common Support (Extrapolation)
Treated and control groups have substantial overlap in propensity score distributions
After trimming the 5% of treated units outside common support, ATT = $1,650 (SE = $520). 95% of treated units have well-matched controls. Results are stable to trimming thresholds.
Using Bootstrap Standard Errors with Nearest-Neighbor Matching
Use Abadie-Imbens analytical standard errors for nearest-neighbor matching
ATT = $1,794, Abadie-Imbens SE = $680, 95% CI: [$461, $3,127]. Coverage is correct.
After propensity score matching, you find that the standardized difference for pre-treatment earnings (re75) is 0.03, but the standardized difference for age is 0.22. What should you do?
H. Practice
A researcher matches treated and control units on propensity scores and finds excellent balance: all standardized differences are below 0.05. She concludes that the matching estimate is now as credible as an experimental estimate. What is wrong with this claim?
You estimate the ATT of a job training program using propensity score matching. After matching, you find that 40% of treated units are dropped because they fall outside the region of common support. What should you be most concerned about?
A researcher matches charter school students to public school students using pre-treatment test scores, parental income, and race. She also includes 'current school satisfaction' as a matching variable. What is the problem?
After nearest-neighbor propensity score matching (1:1 without replacement), a researcher uses bootstrap standard errors with 500 replications. A reviewer objects. Why?
Propensity Score Matching: Job Training and Earnings
A policy analyst wants to estimate the effect of a voluntary job training program on participants' earnings two years later. Participants self-selected into training, so treated and control workers may differ in age, education, and prior work history. She plans to use propensity score matching to create a comparable control group.
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.
Paper Summary
A management study examines whether firms that adopt corporate social responsibility (CSR) practices have better financial performance. The authors use propensity score matching on firm size, industry, profitability, and leverage. They match CSR-adopting firms to non-adopting firms and find a positive effect of CSR on Tobin's Q. They conclude that CSR causes better firm performance.
Key Table
| Variable | Pre-match SD | Post-match SD |
|---|---|---|
| Firm size (log) | 0.45 | 0.04 |
| Industry | 0.12 | 0.02 |
| Profitability | 0.38 | 0.07 |
| Leverage | 0.22 | 0.09 |
| N (treated) | 500 | 480 |
| N (control) | 3,200 | 480 |
Authors' Identification Claim
After propensity score matching, treated and control firms are balanced on all observed characteristics. Therefore, the difference in Tobin's Q reflects the causal effect of CSR adoption.
I. Swap-In: When to Use Something Else
- Difference-in-differences: When a policy change creates a natural experiment with temporal variation — DiD does not require selection on observables alone.
- IV / 2SLS: When an instrument is available to address selection on unobservables that matching cannot handle.
- Doubly robust estimation: When you want robustness to misspecification of either the outcome model or the propensity score model — combines matching/weighting logic with regression adjustment.
- DML: When the covariate space is high-dimensional and linear specifications may miss important nonlinearities — DML uses machine learning for nuisance estimation with valid post-selection inference.
- Entropy balancing: When exact moment-balance on covariates is desired without discarding observations — Hainmueller (2012) provides a reweighting approach that guarantees balance by construction.
J. Reviewer Checklist
Critical Reading Checklist
Paper Library
Foundational (9)
Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects.
This paper introduced propensity score matching. Rosenbaum and Rubin showed that instead of matching on many covariates simultaneously, you can match on a single number—the propensity score (predicted probability of treatment)—and this is sufficient to remove selection bias under the assumption of no unobserved confounders.
Heckman, J. J., Ichimura, H., & Todd, P. E. (1997). Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme.
Heckman, Ichimura, and Todd developed the econometric theory behind matching estimators, including conditions for identification and the importance of common support. They applied these methods to evaluate job training programs and showed when matching works well and when it does not.
Iacus, S. M., King, G., & Porro, G. (2012). Causal Inference without Balance Checking: Coarsened Exact Matching.
This paper introduced Coarsened Exact Matching (CEM), which coarsens covariates into bins and then performs exact matching within those bins. CEM avoids many pitfalls of propensity score matching, such as the need to check balance iteratively, and gives the researcher direct control over the matching quality.
Abadie, A., & Imbens, G. W. (2006). Large Sample Properties of Matching Estimators for Average Treatment Effects.
Abadie and Imbens derived the large-sample properties of nearest-neighbor matching estimators and showed that the standard bootstrap is not valid for inference with matching. They proposed a bias-corrected estimator and proper variance formula that have become standard in practice.
Abadie, A., & Imbens, G. W. (2011). Bias-Corrected Matching Estimators for Average Treatment Effects.
Abadie and Imbens developed bias-corrected matching estimators that adjust for the finite-sample bias inherent in nearest-neighbor matching when matching is not exact. Their bias correction uses a regression adjustment within matched pairs and has become a standard recommendation for applied researchers using matching methods.
Cattaneo, M. D., Drukker, D. M., & Holland, A. D. (2013). Estimation of Multivalued Treatment Effects Under Conditional Independence.
Cattaneo, Drukker, and Holland extended matching and inverse probability weighting methods to settings with multi-valued (rather than binary) treatments, developing estimators for dose-response functions under conditional independence. Their accompanying Stata implementation made these methods readily accessible to applied researchers.
LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data.
LaLonde compared econometric estimates of a job training program's effect with experimental benchmarks from a randomized trial, finding that non-experimental methods often failed to replicate the experimental results. This paper established the standard test bed for evaluating matching and other observational causal methods.
Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.
Hainmueller introduced entropy balancing, a reweighting scheme that directly targets covariate balance by finding weights that satisfy pre-specified balance constraints while remaining as close to uniform as possible. Entropy balancing has become a popular alternative to propensity score matching because it achieves exact balance on specified moments by construction.
Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.
Argues that matching should be used as a preprocessing step before parametric modeling, reducing model dependence and improving robustness of causal estimates. This influential paper reframed matching not as a standalone estimator but as a way to make subsequent parametric analyses less sensitive to specification choices.
Application (6)
Dehejia, R. H., & Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.
Dehejia and Wahba showed that propensity score matching could replicate experimental estimates of a job training program using observational data. This influential paper demonstrated the practical value of matching and made propensity score methods mainstream in applied social science.
Villalonga, B., & Amit, R. (2006). How Do Family Ownership, Control and Management Affect Firm Value?.
This paper studied how different forms of family involvement in firms affect value, using matching and regression methods to compare family and non-family firms. It illustrates how matching can help address selection issues in corporate governance research.
Azoulay, P., Graff Zivin, J. S., & Wang, J. (2010). Superstar Extinction.
Azoulay and coauthors used propensity score matching to construct a control group of scientists who did not experience the unexpected death of a 'superstar' collaborator. They found that the death of a superstar leads to a lasting decline in the productivity of their collaborators. This study is an elegant application of matching in the economics of science and innovation.
Kaul, A., Klossner, S., Pfeifer, G., & Schieler, M. (2022). Standard Synthetic Control Methods: The Case of Using a False Predictor.
While focused on synthetic control (a form of matching for aggregate units), this paper highlights pitfalls when matching on pre-treatment outcomes and is relevant for understanding matching assumptions more broadly. [UNVERIFIED - publication year/details may differ from working paper version]
Arrfelt, M., Wiseman, R. M., & Hult, G. T. M. (2013). Looking Backward Instead of Forward: Aspiration-Driven Influences on the Efficiency of the Capital Allocation Process.
This paper used propensity score matching alongside other methods to study how performance relative to aspirations affects capital allocation in diversified firms. Published in AMJ, it is an example of how matching methods have been adopted in top management journals to address selection concerns.
Imbens, G. W. (2015). Matching Methods in Practice: Three Examples.
Imbens demonstrated how to implement matching methods in practice through three detailed empirical examples, covering propensity score estimation, covariate balance assessment, and sensitivity analysis. This paper is an invaluable practical guide that bridges the gap between matching theory and applied research.
Survey (6)
King, G., & Nielsen, R. (2019). Why Propensity Scores Should Not Be Used for Matching.
King and Nielsen argue that propensity score matching can increase imbalance, model dependence, and bias relative to other matching methods. This provocative paper has influenced a shift toward alternatives like CEM and Mahalanobis distance matching in applied research.
Shipman, J. E., Swanquist, Q. T., & Whited, R. L. (2017). Propensity Score Matching in Accounting Research.
This paper reviews how propensity score matching has been used (and sometimes misused) in accounting research. It provides practical guidelines on common pitfalls such as matching on post-treatment variables, inadequate balance checks, and ignoring the unconfoundedness assumption.
Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.
Imbens provided a comprehensive review of nonparametric methods for estimating average treatment effects under the unconfoundedness assumption, covering matching, weighting, and subclassification estimators. This survey unified the theoretical foundations of matching methods and clarified the connections between different estimators used in program evaluation.
Rosenbaum, P. R. (2002). Observational Studies.
The definitive textbook on observational study design, covering matching, sensitivity analysis, and design principles for drawing causal inferences from non-experimental data. Rosenbaum's framework for sensitivity analysis (Rosenbaum bounds) is the standard tool for assessing how much unobserved confounding would be needed to overturn a matching-based finding.
Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction.
A comprehensive textbook grounding causal inference in the potential outcomes framework, with detailed treatment of matching, propensity scores, and subclassification. Provides rigorous foundations for selection-on-observables methods.
Stuart, E. A. (2010). Matching Methods for Causal Inference: A Review and a Look Forward.
A comprehensive review of matching methods including propensity score matching, Mahalanobis distance matching, and coarsened exact matching, with practical guidance on implementation. Provides an accessible overview of when and how to use different matching approaches.