Lab: Replicating Dehejia-Wahba (1999) Propensity Score Matching
Replicate the classic Dehejia-Wahba (1999) propensity score matching analysis of the National Supported Work (NSW) program. Compare experimental benchmarks with observational estimates using nearest-neighbor matching, propensity score matching, and coarsened exact matching.
Overview
In this lab you will replicate one of the most influential studies in the causal inference literature. Lalonde (1986) showed that non-experimental estimators often fail to recover the experimental benchmark estimate of a job training program. Dehejia and Wahba (1999) demonstrated that propensity score matching can recover this benchmark from observational data. You will reproduce their key findings and extend the analysis with modern matching methods.
What you will learn:
- How to load and work with the classic Lalonde/NSW dataset
- How the experimental benchmark provides a ground truth for evaluating observational methods
- How to implement propensity score matching, nearest-neighbor matching, and CEM
- How to assess covariate balance before and after matching
- Why some observational comparison groups are harder to match than others
Prerequisites: Familiarity with the potential outcomes framework and propensity scores. Completion of the matching tutorial lab is recommended.
Step 1: Load the NSW Experimental Data
The NSW dataset contains earnings data for randomly assigned treatment and control groups from the National Supported Work program.
library(MatchIt)
library(cobalt)
# The Lalonde data is built into MatchIt
data("lalonde", package = "MatchIt")
# Experimental subset
nsw <- lalonde
cat("=== NSW Experimental Data ===\n")
cat("N treated:", sum(nsw$treat == 1), "\n")
cat("N control:", sum(nsw$treat == 0), "\n")
# Experimental benchmark
ate_exp <- mean(nsw$re78[nsw$treat == 1]) - mean(nsw$re78[nsw$treat == 0])
cat("Experimental ATE: $", round(ate_exp), "\n")Expected output:
=== NSW Experimental Benchmark ===
Treated: n = 185, mean RE78 = $6,349
Control: n = 260, mean RE78 = $4,555
Experimental ATE: $1,794
(Dehejia-Wahba report approximately $1,794)
| Group | N | Mean RE78 |
|---|---|---|
| NSW Treated | 185 | ~$6,349 |
| NSW Control (experimental) | 260 | ~$4,555 |
| Experimental ATE | — | ~$1,794 |
This experimental estimate serves as the benchmark against which all observational methods will be evaluated.
Step 2: Construct the Observational Dataset
Replace the experimental control group with a non-experimental comparison group drawn from the CPS or PSID.
# The MatchIt 'lalonde' dataset includes PSID/CPS-1 controls
# by default (not the experimental controls from Step 1).
# If your version uses experimental controls, download the
# CPS comparison group from: https://users.nber.org/~rdehejia/data/
data("lalonde", package = "MatchIt")
# Construct: NSW treated + non-experimental controls
nsw_treat <- lalonde[lalonde$treat == 1, ]
obs_ctrl <- lalonde[lalonde$treat == 0, ]
obs <- lalonde
naive_diff <- mean(obs$re78[obs$treat == 1]) - mean(obs$re78[obs$treat == 0])
cat("Naive difference: $", round(naive_diff), "\n")
cat("Experimental ATE: $", round(ate_exp), "\n")
cat("Bias: $", round(naive_diff - ate_exp), "\n")Expected output:
=== Naive Observational Estimate ===
Treated mean RE78: $6,349
CPS mean RE78: $14,847
Naive difference: -$8,498
Experimental ATE: $1,794
Bias: -$10,292
The CPS group earns much more — naive comparison is severely biased.
| Statistic | Value |
|---|---|
| NSW treated mean RE78 | ~$6,349 |
| CPS control mean RE78 | ~$14,847 |
| Naive difference | ~-$8,498 |
| Experimental ATE | ~$1,794 |
| Bias | ~-$10,292 |
The naive observational difference is not just wrong — it has the wrong sign. Instead of the true positive effect of ~8,500.
The naive difference between NSW treated individuals and CPS comparison individuals is very different from the experimental estimate. What is the primary source of this bias?
Step 3: Estimate the Propensity Score
# Propensity score (DW specification)
ps_model <- glm(treat ~ age + education + black + hispanic + married +
nodegree + re74 + re75,
data = obs, family = binomial)
obs$pscore <- predict(ps_model, type = "response")
# Overlap check
hist(obs$pscore[obs$treat == 1], col = rgb(0,0,1,0.5), xlim = c(0,1),
main = "Propensity Score Distribution", xlab = "Propensity Score")
hist(obs$pscore[obs$treat == 0], col = rgb(1,0,0,0.5), add = TRUE)
legend("topright", c("Treated", "Control"), fill = c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)))Expected output:
PS range (treated): [0.0215, 0.9842]
PS range (control): [0.000002, 0.8215]
Note: most CPS controls have very low propensity scores
| Group | PS Min | PS Median | PS Max |
|---|---|---|---|
| NSW Treated | 0.021 | 0.625 | 0.984 |
| CPS Control | 0.000 | 0.005 | 0.822 |
Step 4: Nearest-Neighbor Matching on the Propensity Score
# Nearest-neighbor matching using MatchIt
m_nn <- matchit(treat ~ age + education + black + hispanic + married +
nodegree + re74 + re75,
data = obs, method = "nearest", distance = "glm",
caliper = 0.1)
summary(m_nn)
# Extract matched data and estimate ATT
m_data <- match.data(m_nn)
ate_nn <- lm(re78 ~ treat, data = m_data, weights = weights)
cat("NN Matching ATT:", coef(ate_nn)["treat"], "\n")
cat("Experimental:", ate_exp, "\n")Expected output:
=== Nearest-Neighbor PS Matching (1:1) ===
Matched ATE: $1,652
Experimental ATE: $1,794
Bias: -$142
With caliper = 0.05: 142/185 treated units matched
Caliper-matched ATE: $1,815
| Method | ATE Estimate | Bias vs. Experimental | N Matched |
|---|---|---|---|
| NN PS Matching (1:1) | ~$1,652 | -$142 | 185 |
| Caliper matching (0.05) | ~$1,815 | +$21 | ~142 |
Nearest-neighbor matching on the propensity score recovers an estimate close to the experimental benchmark. Caliper matching drops treated units without close matches, potentially improving the estimate at the cost of a narrower estimand.
Step 5: Coarsened Exact Matching (CEM)
CEM creates strata based on coarsened covariate values, then matches exactly within strata.
# CEM using MatchIt
m_cem <- matchit(treat ~ age + education + black + hispanic + married +
nodegree + re74 + re75,
data = obs, method = "cem",
cutpoints = list(age = c(25, 30, 40),
education = c(9, 11, 12),
re74 = c(0, 5000, 15000),
re75 = c(0, 5000, 15000)))
summary(m_cem)
m_cem_data <- match.data(m_cem)
ate_cem <- lm(re78 ~ treat, data = m_cem_data, weights = weights)
cat("CEM ATT:", coef(ate_cem)["treat"], "\n")Expected output:
=== Coarsened Exact Matching ===
Matched: 148 treated, 1245 controls
CEM ATE: $1,925
Experimental ATE: $1,794
| Method | ATE Estimate | N Treated Matched | N Control Matched |
|---|---|---|---|
| CEM | ~$1,925 | ~148 | ~1,245 |
| Experimental benchmark | $1,794 | — | — |
CEM drops ~37 treated units that have no counterpart in any CPS stratum. The estimate is close to the experimental benchmark, but pertains only to the matchable subpopulation.
CEM guarantees exact balance on coarsened covariates within matched strata, but it typically discards many observations. In this Lalonde replication, why is the tradeoff particularly relevant?
Step 6: Compare All Estimates with the Experimental Benchmark
# Summary comparison
cat("=== Summary ===\n")
cat("Experimental:", round(ate_exp), "\n")
cat("Naive:", round(naive_diff), "\n")
cat("OLS:", round(coef(lm(re78 ~ treat + age + education + black + hispanic +
married + nodegree + re74 + re75, data = obs))["treat"]), "\n")
cat("NN Matching:", round(coef(ate_nn)["treat"]), "\n")
cat("CEM:", round(coef(ate_cem)["treat"]), "\n")
# Balance comparison
love.plot(m_nn, stats = "mean.diffs", abs = TRUE,
title = "Covariate Balance: NN Matching")Expected output:
=== Summary of Estimates ===
Method ATE Bias
Experimental benchmark $1,794 $0
Naive (unadjusted) -$8,498 -$10,292
OLS with controls $1,245 -$549
NN PS matching (1:1) $1,652 -$142
PS matching (caliper) $1,815 $21
CEM $1,925 $131
| Method | ATE Estimate | Bias |
|---|---|---|
| Experimental benchmark | $1,794 | $0 |
| Naive (unadjusted) | -$8,498 | -$10,292 |
| OLS with controls | ~$1,245 | -$549 |
| NN PS matching (1:1) | ~$1,652 | -$142 |
| PS matching (caliper) | ~$1,815 | +$21 |
| CEM | ~$1,925 | +$131 |
Extension Exercises
-
Drop pre-treatment earnings. Re-run all matching methods without re74 and re75. How much worse are the estimates? This exercise replicates Lalonde's (1986) original finding.
-
Use the PSID comparison group. The PSID comparison group is smaller and somewhat more similar to the NSW treated group. Repeat the analysis and compare.
-
Try doubly robust estimation. Implement an AIPW estimator (combining PS weighting with outcome regression) and compare with matching alone.
-
Vary the caliper width. Run PS matching with calipers of 0.01, 0.05, 0.10, and 0.25. Plot the tradeoff between the number of matches and the estimated ATE.
-
Sensitivity analysis. Use the Rosenbaum bounds or the Cinelli-Hazlett framework to assess how sensitive the matching estimate is to unobserved confounders.
Summary
In this lab you learned:
- The NSW experimental benchmark provides a ground truth for evaluating observational estimators
- Naive comparisons with non-experimental control groups can be severely biased due to covariate imbalance
- Propensity score matching recovers the experimental benchmark when the right covariates (especially lagged outcomes) are available and overlap is sufficient
- CEM guarantees exact balance on coarsened covariates but may discard observations, changing the estimand
- Covariate balance diagnostics (standardized mean differences, overlap plots) are essential for assessing match quality
- The Lalonde dataset remains the most important benchmark in causal inference: if a method cannot recover the experimental estimate from observational data, it should not be trusted for settings where no experiment exists