MethodAtlas
Model-BasedEstablished

Matching (PSM, CEM, NN, Weighting)

Reduces selection bias by comparing treated units to similar control units based on observed characteristics.

Quick Reference

When to Use
When selection into treatment depends only on observed covariates (selection on observables / conditional independence), and you want a transparent, nonparametric comparison of treated and control units.
Key Assumption
Conditional independence (unconfoundedness): conditional on observed pre-treatment covariates, treatment assignment is independent of potential outcomes. Also requires overlap (common support) — for every covariate profile, both treated and control units must exist.
Common Mistake
Matching on post-treatment variables (which introduces collider bias), or failing to assess and report balance after matching. Checking standardized mean differences is a standard diagnostic step.
Estimated Time
3 hours

One-Line Implementation

Stata: teffects psmatch (outcome) (treatment x1 x2), atet
R: matchit(treatment ~ x1 + x2, data = df, method = 'nearest', distance = 'glm')
Python: CausalModel(Y, D, X).est_via_matching() # causalinference

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Evaluating a Job Training Program

In the 1970s, the National Supported Work (NSW) Demonstration randomly assigned disadvantaged workers to a job training program. The experimental data showed the program raised earnings by about $1,800 per year.

But what if the experiment had never been run? Could you recover the treatment effect using observational data alone? Dehejia and Wahba (1999) took the treated group from the NSW experiment and matched them to comparison groups drawn from large survey datasets (the CPS and PSID). Using propensity score matching on pre-treatment covariates (age, education, earnings history, race, marital status), they recovered estimates remarkably close to the experimental benchmark.

(Dehejia & Wahba, 1999)

This paper became both a prominent advertisement and a cautionary tale for matching. It showed matching can work — but only when you have the right covariates and the comparison group overlaps well with the treated group.

had previously shown that naive observational methods (including OLS) failed badly at recovering the experimental benchmark. Matching was an improvement — but not a magic bullet.


A. Overview: The Idea Behind Matching

The Core Problem

In observational studies, treated and control units differ systematically. Comparing their average outcomes confounds the treatment effect with selection bias. Matching addresses this by finding, for each treated unit, one or more control units that are "similar" on observed characteristics.

The Selection-on-Observables Assumption

All matching methods rest on:

Yi(0),Yi(1) ⁣ ⁣ ⁣DiXiY_i(0), Y_i(1) \perp\!\!\!\perp D_i \mid X_i

This condition says: conditional on observed covariates XiX_i, treatment assignment is independent of . In words, once you condition on the right set of observables, treated and control units are comparable — as if treatment were randomly assigned within strata of XX. Conditional independence is a much stronger assumption than what is needed in OLS (which requires the weaker mean-independence condition), but matching gives you nonparametric identification of the treatment effect without relying on a specific functional form.

This assumption is also called conditional independence, unconfoundedness, or ignorability.

The Four Main Matching Approaches

MethodHow It MatchesKey Feature
Propensity Score Matching (PSM)Match on estimated probability of treatment: $e(X_i) = P(D_i=1X_i)$
Coarsened Exact Matching (CEM)Coarsen covariates into bins, then exact-match within binsAvoids propensity score estimation; transparent
Nearest-Neighbor (NN)Match each treated unit to the closest control unit(s) in covariate spaceSimple and intuitive; can match on Mahalanobis distance
Inverse Probability Weighting (IPW)Weight observations by inverse of propensity scoreUses all observations (no dropping); semi-parametric

Common Confusions

(King & Nielsen, 2019)

B. Identification

Propensity Score Theorem

Rosenbaum and Rubin (1983) proved a remarkable result: if conditional independence holds given XiX_i, then it also holds given only the propensity score e(Xi)=P(Di=1Xi)e(X_i) = P(D_i = 1 | X_i):

Yi(0),Yi(1) ⁣ ⁣ ⁣Die(Xi)Y_i(0), Y_i(1) \perp\!\!\!\perp D_i \mid e(X_i) (Rosenbaum & Rubin, 1983)

This theorem reduces the matching problem from high-dimensional covariate space to a single dimension. Instead of finding units identical on age, education, income, race, etc., you match on a single number — the estimated probability of being treated.

Common Support

In addition to conditional independence, you need common support (or overlap):

0<P(Di=1Xi)<1for all Xi0 < P(D_i = 1 | X_i) < 1 \quad \text{for all } X_i

For every value of the covariates, there must be both treated and control units. If treated units have no comparable controls (e.g., all high-income individuals are treated), the treatment effect is not identified in that region.

The Estimand: ATT vs. ATE

  • ATT (): For each treated unit, find a matched control and compute the difference. Most common with matching.
  • ATE (): Match in both directions — find controls for treated units AND find treated units for controls. Requires stronger overlap.

C. Visual Intuition

Imagine a two-dimensional scatterplot. The horizontal axis is age, the vertical axis is income. Red dots are treated individuals; blue dots are controls. Without matching, the red and blue dots occupy different regions — treated units are younger and lower-income.

Matching finds, for each red dot, the nearest blue dot (or a set of nearby blue dots). After matching, the remaining comparison group (the matched blues) has the same age/income distribution as the treated group. The distribution of covariates is balanced.

With propensity score matching, you collapse both dimensions into one score, align treated and control units on that score, and compare. The balance is not on any single covariate but on the overall probability of treatment.

Interactive Simulation

Propensity Score Overlap

As selection into treatment grows stronger, the propensity score distributions for treated and control units separate, reducing common support. The matched estimate improves over the naive estimate only when overlap is sufficient.

0.00.20.40.60.81.0Selection Strength1.6k2.2k2.8k3.3k3.9kEstimated Effect ($)Bias: $1600True EffectNaive Estimate

Computed Results

Common Support (% of treated)
82.0
Naive (unadjusted) Estimate
3.20
Matched Estimate
2.15
05
502000
05
Interactive Simulation

Why Matching?

DGP: D depends on X via logistic selection (strength = 1.0); Y = 2.0·D + 2·X + ε. N = 200 (95 treated, 105 control). 42 matched pairs formed.

-11.4-6.6-1.73.18.012.9Covariate XOutcome Y
TreatedControlMatched pairs

Estimation Results

Estimatorβ̂SE95% CIBias
Naive diff. in means6.7300.430[5.89, 7.57]+4.730
OLS controlling for Xclosest2.1300.139[1.86, 2.40]+0.130
NN matching2.3380.190[1.97, 2.71]+0.338
True β2.000
200

Total number of observations

2.0

The causal effect of treatment on outcome

1.0

How strongly X determines treatment (0 = random assignment)

Why the difference?

The naive difference in means is biased (+4.73) because treatment assignment depends on X (selection strength = 1.0). Treated units have systematically different X values, and since X directly affects Y, the simple comparison confounds the treatment effect with the effect of X. OLS controlling for X removes most of the bias (β̂ = 2.130) by linearly adjusting for the confounding. Nearest-neighbor matching achieves a similar correction (β̂ = 2.338) by comparing each treated unit only to a control unit with a similar X value—ensuring apples-to-apples comparisons without imposing a linear functional form. Note: some treated units could not be matched, indicating limited overlap (common support). Matching drops these units, while OLS extrapolates.

Interactive Simulation

Common Support Detective

Matching and weighting estimators require common support: for each treated unit, there should be comparable control units (and vice versa). Explore how selection strength affects the overlap of propensity score distributions and what happens to the ATT estimate when you trim to the region of common support.

0.00.20.40.60.81.0Propensity scoreDensityTreatedControlOverlap region
0.8

How strongly X predicts treatment assignment (0 = random, 2 = strong selection)

150

Number of treated units

150

Number of control units

Common Support

Overlap region[0.24, 0.88]
Units in support142 T + 138 C
Overlap fraction93%

ATT Estimates

MethodATT
Naive (full sample)5.059
Naive (common support)4.637
True ATT3.445
Change from trimming0.422

Good overlap. 93% of units are in common support. Trimming has a small effect on the estimate, suggesting the comparison is well-supported across the propensity score distribution.


D. Mathematical Derivation

Don't worry about the notation yet — here's what this means in words: For each treated unit, find one or more controls with similar propensity scores and take the average difference. This removes bias from observed confounders.

Propensity score estimation (first step):

Estimate e^(Xi)=P(Di=1Xi)\hat{e}(X_i) = P(D_i = 1 | X_i) using logit:

e^(Xi)=Λ(Xiγ^)\hat{e}(X_i) = \Lambda(X_i'\hat{\gamma})

Matching (second step):

For each treated unit ii with Di=1D_i = 1, define the matched set M(i)\mathcal{M}(i) as the control unit(s) closest in propensity score:

M(i)=argminj:Dj=0e^(Xi)e^(Xj)\mathcal{M}(i) = \arg\min_{j: D_j = 0} |\hat{e}(X_i) - \hat{e}(X_j)|

ATT estimation:

τ^ATT=1N1i:Di=1[Yi1M(i)jM(i)Yj]\hat{\tau}_{ATT} = \frac{1}{N_1} \sum_{i: D_i = 1} \left[ Y_i - \frac{1}{|\mathcal{M}(i)|} \sum_{j \in \mathcal{M}(i)} Y_j \right]

IPW alternative (for ATT):

τ^ATT,IPW=1N1i:Di=1Yi    i:Di=0e^(Xi)1e^(Xi)Yii:Di=0e^(Xi)1e^(Xi)\hat{\tau}_{ATT,IPW} = \frac{1}{N_1} \sum_{i: D_i=1} Y_i \;-\; \frac{\displaystyle\sum_{i: D_i=0} \frac{\hat{e}(X_i)}{1-\hat{e}(X_i)} Y_i}{\displaystyle\sum_{i: D_i=0} \frac{\hat{e}(X_i)}{1-\hat{e}(X_i)}}

The IPW estimator reweights the control group to look like the treated group, using the propensity score. The weights e^(X)/(1e^(X))\hat{e}(X)/(1-\hat{e}(X)) tilt the control distribution toward the treated covariate profile.

Standard errors: Because the propensity score is estimated, standard errors must account for this estimation uncertainty. (Abadie & Imbens, 2008) showed that bootstrap is invalid for nearest-neighbor matching; use the analytical formula from Abadie and Imbens (2006) instead. For IPW, bootstrap or robust SEs work.

(Abadie & Imbens, 2006)

E. Implementation

library(MatchIt)
library(cobalt)

# Propensity score matching (nearest neighbor, 1:1)
m_out <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
               data = df, method = "nearest", distance = "logit",
               replace = FALSE, ratio = 1)

# Check balance
summary(m_out)
love.plot(m_out, thresholds = c(m = .1))  # Standardized mean difference threshold

# Extract matched data and estimate treatment effect
m_data <- match.data(m_out)
library(lmtest)
library(sandwich)
fit <- lm(earnings ~ training, data = m_data, weights = weights)
coeftest(fit, vcov. = vcovHC(fit, type = "HC1"))

# CEM
m_cem <- matchit(training ~ age + education + re74 + re75 + black + hispanic,
               data = df, method = "cem")
summary(m_cem)

# IPW
library(WeightIt)
w_out <- weightit(training ~ age + education + re74 + re75 + black + hispanic,
                data = df, method = "ps", estimand = "ATT")
bal.tab(w_out, thresholds = 0.1)

F. Diagnostics: Balance Is Everything

Standardized Differences

The most widely recommended metric for assessing balance. For each covariate:

SDj=Xˉj,treatedXˉj,control(sj,treated2+sj,control2)/2\text{SD}_j = \frac{\bar{X}_{j,\text{treated}} - \bar{X}_{j,\text{control}}}{\sqrt{(s^2_{j,\text{treated}} + s^2_{j,\text{control}})/2}}

Common Support Diagnostics

  • Plot the distribution of propensity scores for treated and control groups. They should overlap substantially.
  • Trim observations outside the region of common support (propensity scores below the minimum or above the maximum of the other group).
  • If trimming drops many observations, your comparison is fragile and results may not generalize.

Variance Ratios

Beyond means, check that the variance of each covariate is similar across groups after matching. A variance ratio (treated/control) far from 1.0 indicates remaining imbalance in the tails of the distribution.

Distribution Checks

For critical covariates, compare the full distributions (not just means) using QQ plots or Kolmogorov-Smirnov tests.


Interpreting Results

  • It is important to present the ATT (effect on the treated) or ATE, and be explicit about which you estimate.
  • Matching estimates are only as credible as the conditional independence assumption. If you cannot convincingly argue that you have captured all confounders, say so.
  • Report both the matched and unmatched estimates. The difference tells you how much bias matching removes (at least for observed confounders).
  • Consider sensitivity analysis (Rosenbaum bounds) to assess how much unobserved confounding would be needed to overturn your results.
(Rosenbaum, 2002)

G. What Can Go Wrong

ProblemWhat It DoesHow to Fix It
Unobserved confoundersMatching cannot fix what it cannot see — estimates are biasedSensitivity analysis (Rosenbaum bounds); combine with IV or DiD
Matching on post-treatment variablesIntroduces collider biasMatch only on pre-treatment covariates
Poor common supportComparisons are based on extrapolationTrim, restrict sample, report trimmed and full results
Over-reliance on PSMPSM can worsen balance (King & Nielsen, 2019)Consider CEM, entropy balancing, or doubly robust methods
Wrong standard errorsBootstrap is invalid for NN matchingUse Abadie-Imbens analytical SEs; or use IPW with robust SEs
Too many matching variablesCurse of dimensionality; poor matchesUse propensity score or dimension-reducing methods
Caliper too wideMatches are poor qualityTighten the caliper; report balance
Assumption Failure Demo

Matching on a Post-Treatment Variable (Collider Bias)

Match job training participants to non-participants using only pre-treatment covariates: age, education, prior earnings (re74, re75), race, marital status

ATT estimate = $1,794 (close to the NSW experimental benchmark of $1,800). Balance is good on all pre-treatment covariates.

Assumption Failure Demo

Poor Common Support (Extrapolation)

Treated and control groups have substantial overlap in propensity score distributions

After trimming the 5% of treated units outside common support, ATT = $1,650 (SE = $520). 95% of treated units have well-matched controls. Results are stable to trimming thresholds.

Assumption Failure Demo

Using Bootstrap Standard Errors with Nearest-Neighbor Matching

Use Abadie-Imbens analytical standard errors for nearest-neighbor matching

ATT = $1,794, Abadie-Imbens SE = $680, 95% CI: [$461, $3,127]. Coverage is correct.

Concept Check

After propensity score matching, you find that the standardized difference for pre-treatment earnings (re75) is 0.03, but the standardized difference for age is 0.22. What should you do?


H. Practice

Concept Check

A researcher matches treated and control units on propensity scores and finds excellent balance: all standardized differences are below 0.05. She concludes that the matching estimate is now as credible as an experimental estimate. What is wrong with this claim?

Concept Check

You estimate the ATT of a job training program using propensity score matching. After matching, you find that 40% of treated units are dropped because they fall outside the region of common support. What should you be most concerned about?

Concept Check

A researcher matches charter school students to public school students using pre-treatment test scores, parental income, and race. She also includes 'current school satisfaction' as a matching variable. What is the problem?

Concept Check

After nearest-neighbor propensity score matching (1:1 without replacement), a researcher uses bootstrap standard errors with 500 replications. A reviewer objects. Why?

Guided Exercise

Propensity Score Matching: Job Training and Earnings

A policy analyst wants to estimate the effect of a voluntary job training program on participants' earnings two years later. Participants self-selected into training, so treated and control workers may differ in age, education, and prior work history. She plans to use propensity score matching to create a comparable control group.

What does the propensity score summarize?

What is the common support condition, and why does it matter?

After matching, what should the analyst check to verify the matching worked?

Even with perfect balance on observed covariates, why might the matching estimate still be biased?

Error Detective

Read the analysis below carefully and identify the errors.

A health researcher studies whether a new surgical technique reduces recovery time. They use propensity score matching on patient age, BMI, and insurance type to compare 200 patients who received the new technique to 200 matched controls who received the standard procedure. They report: "After matching, standardized differences are: age (0.04), BMI (0.06), insurance type (0.02). Balance is excellent. The matched estimate shows a 3.2-day reduction in recovery time (p = 0.001). We use bootstrap standard errors (500 replications). Since treated and control patients are balanced on all observables, the estimate is causal." They do not report any sensitivity analysis.

Select all errors you can find:

Error Detective

Read the analysis below carefully and identify the errors.

An education researcher matches charter school students to traditional public school students using propensity scores estimated from a logistic regression. The propensity score model includes: parental income, parental education, student's 4th-grade test scores, and student race. After 1:1 nearest-neighbor matching: "Standardized differences are below 0.10 for all variables. The matched charter school students score 0.25 standard deviations higher on 8th-grade math tests (p < 0.05)." The researcher also reports that 150 of the 500 charter school students were dropped because they fell outside common support (propensity scores above 0.95).

Select all errors you can find:

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

A management study examines whether firms that adopt corporate social responsibility (CSR) practices have better financial performance. The authors use propensity score matching on firm size, industry, profitability, and leverage. They match CSR-adopting firms to non-adopting firms and find a positive effect of CSR on Tobin's Q. They conclude that CSR causes better firm performance.

Key Table

VariablePre-match SDPost-match SD
Firm size (log)0.450.04
Industry0.120.02
Profitability0.380.07
Leverage0.220.09
N (treated)500480
N (control)3,200480

Authors' Identification Claim

After propensity score matching, treated and control firms are balanced on all observed characteristics. Therefore, the difference in Tobin's Q reflects the causal effect of CSR adoption.


I. Swap-In: When to Use Something Else

  • Difference-in-differences: When a policy change creates a natural experiment with temporal variation — DiD does not require selection on observables alone.
  • IV / 2SLS: When an instrument is available to address selection on unobservables that matching cannot handle.
  • Doubly robust estimation: When you want robustness to misspecification of either the outcome model or the propensity score model — combines matching/weighting logic with regression adjustment.
  • DML: When the covariate space is high-dimensional and linear specifications may miss important nonlinearities — DML uses machine learning for nuisance estimation with valid post-selection inference.
  • Entropy balancing: When exact moment-balance on covariates is desired without discarding observations — Hainmueller (2012) provides a reweighting approach that guarantees balance by construction.

J. Reviewer Checklist

Critical Reading Checklist



Paper Library

Foundational (9)

Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects.

This paper introduced propensity score matching. Rosenbaum and Rubin showed that instead of matching on many covariates simultaneously, you can match on a single number—the propensity score (predicted probability of treatment)—and this is sufficient to remove selection bias under the assumption of no unobserved confounders.

Heckman, J. J., Ichimura, H., & Todd, P. E. (1997). Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme.

Review of Economic StudiesDOI: 10.2307/2971733

Heckman, Ichimura, and Todd developed the econometric theory behind matching estimators, including conditions for identification and the importance of common support. They applied these methods to evaluate job training programs and showed when matching works well and when it does not.

Iacus, S. M., King, G., & Porro, G. (2012). Causal Inference without Balance Checking: Coarsened Exact Matching.

Political AnalysisDOI: 10.1093/pan/mpr013

This paper introduced Coarsened Exact Matching (CEM), which coarsens covariates into bins and then performs exact matching within those bins. CEM avoids many pitfalls of propensity score matching, such as the need to check balance iteratively, and gives the researcher direct control over the matching quality.

Abadie, A., & Imbens, G. W. (2006). Large Sample Properties of Matching Estimators for Average Treatment Effects.

Abadie and Imbens derived the large-sample properties of nearest-neighbor matching estimators and showed that the standard bootstrap is not valid for inference with matching. They proposed a bias-corrected estimator and proper variance formula that have become standard in practice.

Abadie, A., & Imbens, G. W. (2011). Bias-Corrected Matching Estimators for Average Treatment Effects.

Journal of Business & Economic StatisticsDOI: 10.1198/jbes.2009.07333

Abadie and Imbens developed bias-corrected matching estimators that adjust for the finite-sample bias inherent in nearest-neighbor matching when matching is not exact. Their bias correction uses a regression adjustment within matched pairs and has become a standard recommendation for applied researchers using matching methods.

Cattaneo, M. D., Drukker, D. M., & Holland, A. D. (2013). Estimation of Multivalued Treatment Effects Under Conditional Independence.

Cattaneo, Drukker, and Holland extended matching and inverse probability weighting methods to settings with multi-valued (rather than binary) treatments, developing estimators for dose-response functions under conditional independence. Their accompanying Stata implementation made these methods readily accessible to applied researchers.

LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data.

American Economic Review

LaLonde compared econometric estimates of a job training program's effect with experimental benchmarks from a randomized trial, finding that non-experimental methods often failed to replicate the experimental results. This paper established the standard test bed for evaluating matching and other observational causal methods.

Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.

Political AnalysisDOI: 10.1093/pan/mpr025

Hainmueller introduced entropy balancing, a reweighting scheme that directly targets covariate balance by finding weights that satisfy pre-specified balance constraints while remaining as close to uniform as possible. Entropy balancing has become a popular alternative to propensity score matching because it achieves exact balance on specified moments by construction.

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.

Political AnalysisDOI: 10.1093/pan/mpl013

Argues that matching should be used as a preprocessing step before parametric modeling, reducing model dependence and improving robustness of causal estimates. This influential paper reframed matching not as a standalone estimator but as a way to make subsequent parametric analyses less sensitive to specification choices.

Application (6)

Dehejia, R. H., & Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1999.10473858

Dehejia and Wahba showed that propensity score matching could replicate experimental estimates of a job training program using observational data. This influential paper demonstrated the practical value of matching and made propensity score methods mainstream in applied social science.

Villalonga, B., & Amit, R. (2006). How Do Family Ownership, Control and Management Affect Firm Value?.

Journal of Financial EconomicsDOI: 10.1016/j.jfineco.2004.12.005

This paper studied how different forms of family involvement in firms affect value, using matching and regression methods to compare family and non-family firms. It illustrates how matching can help address selection issues in corporate governance research.

Azoulay, P., Graff Zivin, J. S., & Wang, J. (2010). Superstar Extinction.

Quarterly Journal of EconomicsDOI: 10.1162/qjec.2010.125.2.549

Azoulay and coauthors used propensity score matching to construct a control group of scientists who did not experience the unexpected death of a 'superstar' collaborator. They found that the death of a superstar leads to a lasting decline in the productivity of their collaborators. This study is an elegant application of matching in the economics of science and innovation.

Kaul, A., Klossner, S., Pfeifer, G., & Schieler, M. (2022). Standard Synthetic Control Methods: The Case of Using a False Predictor.

Journal of Business & Economic StatisticsDOI: 10.1080/07350015.2021.1930012

While focused on synthetic control (a form of matching for aggregate units), this paper highlights pitfalls when matching on pre-treatment outcomes and is relevant for understanding matching assumptions more broadly. [UNVERIFIED - publication year/details may differ from working paper version]

Arrfelt, M., Wiseman, R. M., & Hult, G. T. M. (2013). Looking Backward Instead of Forward: Aspiration-Driven Influences on the Efficiency of the Capital Allocation Process.

Academy of Management JournalDOI: 10.5465/amj.2010.0879

This paper used propensity score matching alongside other methods to study how performance relative to aspirations affects capital allocation in diversified firms. Published in AMJ, it is an example of how matching methods have been adopted in top management journals to address selection concerns.

Imbens, G. W. (2015). Matching Methods in Practice: Three Examples.

Journal of Human ResourcesDOI: 10.3368/jhr.50.2.373

Imbens demonstrated how to implement matching methods in practice through three detailed empirical examples, covering propensity score estimation, covariate balance assessment, and sensitivity analysis. This paper is an invaluable practical guide that bridges the gap between matching theory and applied research.

Survey (6)

King, G., & Nielsen, R. (2019). Why Propensity Scores Should Not Be Used for Matching.

Political AnalysisDOI: 10.1017/pan.2019.11

King and Nielsen argue that propensity score matching can increase imbalance, model dependence, and bias relative to other matching methods. This provocative paper has influenced a shift toward alternatives like CEM and Mahalanobis distance matching in applied research.

Shipman, J. E., Swanquist, Q. T., & Whited, R. L. (2017). Propensity Score Matching in Accounting Research.

The Accounting ReviewDOI: 10.2308/accr-51449

This paper reviews how propensity score matching has been used (and sometimes misused) in accounting research. It provides practical guidelines on common pitfalls such as matching on post-treatment variables, inadequate balance checks, and ignoring the unconfoundedness assumption.

Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.

Review of Economics and StatisticsDOI: 10.1162/003465304323023651

Imbens provided a comprehensive review of nonparametric methods for estimating average treatment effects under the unconfoundedness assumption, covering matching, weighting, and subclassification estimators. This survey unified the theoretical foundations of matching methods and clarified the connections between different estimators used in program evaluation.

Rosenbaum, P. R. (2002). Observational Studies.

The definitive textbook on observational study design, covering matching, sensitivity analysis, and design principles for drawing causal inferences from non-experimental data. Rosenbaum's framework for sensitivity analysis (Rosenbaum bounds) is the standard tool for assessing how much unobserved confounding would be needed to overturn a matching-based finding.

Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction.

Cambridge University Press

A comprehensive textbook grounding causal inference in the potential outcomes framework, with detailed treatment of matching, propensity scores, and subclassification. Provides rigorous foundations for selection-on-observables methods.

Stuart, E. A. (2010). Matching Methods for Causal Inference: A Review and a Look Forward.

Statistical ScienceDOI: 10.1214/09-STS313

A comprehensive review of matching methods including propensity score matching, Mahalanobis distance matching, and coarsened exact matching, with practical guidance on implementation. Provides an accessible overview of when and how to use different matching approaches.

Tags

design-basedselection-on-observablescross-sectional