MethodAtlas
replication120 minutes

Lab: Replicating Dehejia-Wahba (1999) Propensity Score Matching

Replicate the classic Dehejia-Wahba (1999) propensity score matching analysis of the National Supported Work (NSW) program. Compare experimental benchmarks with observational estimates using nearest-neighbor matching, propensity score matching, and coarsened exact matching.

Overview

In this lab you will replicate one of the most influential studies in the causal inference literature. Lalonde (1986) showed that non-experimental estimators often fail to recover the experimental benchmark estimate of a job training program. Dehejia and Wahba (1999) demonstrated that propensity score matching can recover this benchmark from observational data. You will reproduce their key findings and extend the analysis with modern matching methods.

What you will learn:

  • How to load and work with the classic Lalonde/NSW dataset
  • How the experimental benchmark provides a ground truth for evaluating observational methods
  • How to implement propensity score matching, nearest-neighbor matching, and CEM
  • How to assess covariate balance before and after matching
  • Why some observational comparison groups are harder to match than others

Prerequisites: Familiarity with the potential outcomes framework and propensity scores. Completion of the matching tutorial lab is recommended.


Step 1: Load the NSW Experimental Data

The NSW dataset contains earnings data for randomly assigned treatment and control groups from the National Supported Work program.

library(MatchIt)
library(cobalt)

# The Lalonde data is built into MatchIt
data("lalonde", package = "MatchIt")

# Experimental subset
nsw <- lalonde
cat("=== NSW Experimental Data ===\n")
cat("N treated:", sum(nsw$treat == 1), "\n")
cat("N control:", sum(nsw$treat == 0), "\n")

# Experimental benchmark
ate_exp <- mean(nsw$re78[nsw$treat == 1]) - mean(nsw$re78[nsw$treat == 0])
cat("Experimental ATE: $", round(ate_exp), "\n")

Expected output:

=== NSW Experimental Benchmark ===
Treated: n = 185, mean RE78 = $6,349
Control: n = 260, mean RE78 = $4,555
Experimental ATE: $1,794
(Dehejia-Wahba report approximately $1,794)
GroupNMean RE78
NSW Treated185~$6,349
NSW Control (experimental)260~$4,555
Experimental ATE~$1,794

This experimental estimate serves as the benchmark against which all observational methods will be evaluated.


Step 2: Construct the Observational Dataset

Replace the experimental control group with a non-experimental comparison group drawn from the CPS or PSID.

# The MatchIt 'lalonde' dataset includes PSID/CPS-1 controls
# by default (not the experimental controls from Step 1).
# If your version uses experimental controls, download the
# CPS comparison group from: https://users.nber.org/~rdehejia/data/
data("lalonde", package = "MatchIt")

# Construct: NSW treated + non-experimental controls
nsw_treat <- lalonde[lalonde$treat == 1, ]
obs_ctrl <- lalonde[lalonde$treat == 0, ]

obs <- lalonde

naive_diff <- mean(obs$re78[obs$treat == 1]) - mean(obs$re78[obs$treat == 0])
cat("Naive difference: $", round(naive_diff), "\n")
cat("Experimental ATE: $", round(ate_exp), "\n")
cat("Bias: $", round(naive_diff - ate_exp), "\n")
RequiresMatchIt

Expected output:

=== Naive Observational Estimate ===
Treated mean RE78: $6,349
CPS mean RE78:     $14,847
Naive difference:  -$8,498
Experimental ATE:  $1,794

Bias: -$10,292

The CPS group earns much more — naive comparison is severely biased.
StatisticValue
NSW treated mean RE78~$6,349
CPS control mean RE78~$14,847
Naive difference~-$8,498
Experimental ATE~$1,794
Bias~-$10,292

The naive observational difference is not just wrong — it has the wrong sign. Instead of the true positive effect of ~1,794,thenaiveestimatesuggeststheprogramreducedearningsby 1,794, the naive estimate suggests the program reduced earnings by ~8,500.

Concept Check

The naive difference between NSW treated individuals and CPS comparison individuals is very different from the experimental estimate. What is the primary source of this bias?


Step 3: Estimate the Propensity Score

# Propensity score (DW specification)
ps_model <- glm(treat ~ age + education + black + hispanic + married +
              nodegree + re74 + re75,
              data = obs, family = binomial)
obs$pscore <- predict(ps_model, type = "response")

# Overlap check
hist(obs$pscore[obs$treat == 1], col = rgb(0,0,1,0.5), xlim = c(0,1),
   main = "Propensity Score Distribution", xlab = "Propensity Score")
hist(obs$pscore[obs$treat == 0], col = rgb(1,0,0,0.5), add = TRUE)
legend("topright", c("Treated", "Control"), fill = c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)))

Expected output:

PS range (treated): [0.0215, 0.9842]
PS range (control): [0.000002, 0.8215]
Note: most CPS controls have very low propensity scores
GroupPS MinPS MedianPS Max
NSW Treated0.0210.6250.984
CPS Control0.0000.0050.822

Step 4: Nearest-Neighbor Matching on the Propensity Score

# Nearest-neighbor matching using MatchIt
m_nn <- matchit(treat ~ age + education + black + hispanic + married +
              nodegree + re74 + re75,
              data = obs, method = "nearest", distance = "glm",
              caliper = 0.1)

summary(m_nn)

# Extract matched data and estimate ATT
m_data <- match.data(m_nn)
ate_nn <- lm(re78 ~ treat, data = m_data, weights = weights)
cat("NN Matching ATT:", coef(ate_nn)["treat"], "\n")
cat("Experimental:", ate_exp, "\n")
RequiresMatchIt

Expected output:

=== Nearest-Neighbor PS Matching (1:1) ===
Matched ATE: $1,652
Experimental ATE: $1,794
Bias: -$142

With caliper = 0.05: 142/185 treated units matched
Caliper-matched ATE: $1,815
MethodATE EstimateBias vs. ExperimentalN Matched
NN PS Matching (1:1)~$1,652-$142185
Caliper matching (0.05)~$1,815+$21~142

Nearest-neighbor matching on the propensity score recovers an estimate close to the experimental benchmark. Caliper matching drops treated units without close matches, potentially improving the estimate at the cost of a narrower estimand.


Step 5: Coarsened Exact Matching (CEM)

CEM creates strata based on coarsened covariate values, then matches exactly within strata.

# CEM using MatchIt
m_cem <- matchit(treat ~ age + education + black + hispanic + married +
               nodegree + re74 + re75,
               data = obs, method = "cem",
               cutpoints = list(age = c(25, 30, 40),
                                education = c(9, 11, 12),
                                re74 = c(0, 5000, 15000),
                                re75 = c(0, 5000, 15000)))
summary(m_cem)

m_cem_data <- match.data(m_cem)
ate_cem <- lm(re78 ~ treat, data = m_cem_data, weights = weights)
cat("CEM ATT:", coef(ate_cem)["treat"], "\n")
RequiresMatchIt

Expected output:

=== Coarsened Exact Matching ===
Matched: 148 treated, 1245 controls
CEM ATE: $1,925
Experimental ATE: $1,794
MethodATE EstimateN Treated MatchedN Control Matched
CEM~$1,925~148~1,245
Experimental benchmark$1,794

CEM drops ~37 treated units that have no counterpart in any CPS stratum. The estimate is close to the experimental benchmark, but pertains only to the matchable subpopulation.

Concept Check

CEM guarantees exact balance on coarsened covariates within matched strata, but it typically discards many observations. In this Lalonde replication, why is the tradeoff particularly relevant?


Step 6: Compare All Estimates with the Experimental Benchmark

# Summary comparison
cat("=== Summary ===\n")
cat("Experimental:", round(ate_exp), "\n")
cat("Naive:", round(naive_diff), "\n")
cat("OLS:", round(coef(lm(re78 ~ treat + age + education + black + hispanic +
                        married + nodegree + re74 + re75, data = obs))["treat"]), "\n")
cat("NN Matching:", round(coef(ate_nn)["treat"]), "\n")
cat("CEM:", round(coef(ate_cem)["treat"]), "\n")

# Balance comparison
love.plot(m_nn, stats = "mean.diffs", abs = TRUE,
        title = "Covariate Balance: NN Matching")

Expected output:

=== Summary of Estimates ===
       Method                     ATE       Bias
 Experimental benchmark        $1,794         $0
 Naive (unadjusted)           -$8,498   -$10,292
 OLS with controls             $1,245      -$549
 NN PS matching (1:1)          $1,652      -$142
 PS matching (caliper)         $1,815        $21
 CEM                           $1,925       $131
MethodATE EstimateBias
Experimental benchmark$1,794$0
Naive (unadjusted)-$8,498-$10,292
OLS with controls~$1,245-$549
NN PS matching (1:1)~$1,652-$142
PS matching (caliper)~$1,815+$21
CEM~$1,925+$131

Extension Exercises

  1. Drop pre-treatment earnings. Re-run all matching methods without re74 and re75. How much worse are the estimates? This exercise replicates Lalonde's (1986) original finding.

  2. Use the PSID comparison group. The PSID comparison group is smaller and somewhat more similar to the NSW treated group. Repeat the analysis and compare.

  3. Try doubly robust estimation. Implement an AIPW estimator (combining PS weighting with outcome regression) and compare with matching alone.

  4. Vary the caliper width. Run PS matching with calipers of 0.01, 0.05, 0.10, and 0.25. Plot the tradeoff between the number of matches and the estimated ATE.

  5. Sensitivity analysis. Use the Rosenbaum bounds or the Cinelli-Hazlett framework to assess how sensitive the matching estimate is to unobserved confounders.


Summary

In this lab you learned:

  • The NSW experimental benchmark provides a ground truth for evaluating observational estimators
  • Naive comparisons with non-experimental control groups can be severely biased due to covariate imbalance
  • Propensity score matching recovers the experimental benchmark when the right covariates (especially lagged outcomes) are available and overlap is sufficient
  • CEM guarantees exact balance on coarsened covariates but may discard observations, changing the estimand
  • Covariate balance diagnostics (standardized mean differences, overlap plots) are essential for assessing match quality
  • The Lalonde dataset remains the most important benchmark in causal inference: if a method cannot recover the experimental estimate from observational data, it should not be trusted for settings where no experiment exists