Lab·replication·6 min read

replication120 minutes

Lab: Replicating Dehejia-Wahba (1999) Propensity Score Matching

Replicate Dehejia-Wahba (1999) propensity score matching on NSW: compare experimental benchmarks with nearest-neighbor, propensity score, and CEM estimates.

Method: Matching (PSM, CEM, NN, Weighting)
Languages: Python, R, Stata
Dataset: Simulated data matching Lalonde (1986) NSW experimental + observational summary statistics

Overview

In this lab you will replicate one of the most influential studies in the causal inference literature. showed that non-experimental estimators often fail to recover the experimental benchmark estimate of a job training program. Dehejia and Wahba (1999) demonstrated that propensity score matching can recover this benchmark from observational data. You will reproduce their key findings and extend the analysis with modern matching methods.

What you will learn:

How to load and work with the classic Lalonde/NSW dataset
How the experimental benchmark provides a ground truth for evaluating observational methods
How to implement propensity score matching, nearest-neighbor matching, and CEM
How to assess covariate balance before and after matching
Why some observational comparison groups are harder to match than others

Prerequisites: Familiarity with the potential outcomes framework and propensity scores. Completion of the matching tutorial lab is recommended.

Step 1: Load the NSW Experimental Data

The NSW dataset contains earnings data for randomly assigned treatment and control groups from the National Supported Work program.

1# First-time setup: install.packages(c("MatchIt", "cobalt"))
2library(MatchIt)
3library(cobalt)
4
5# The MatchIt 'lalonde' dataset contains NSW treated + CPS-1 controls
6# (NOT the NSW experimental control group). The experimental benchmark
7# ATE of ~$1,794 comes from Dehejia & Wahba (1999, Table 3) using the
8# NSW randomized control group, which is not included here.
9data("lalonde", package = "MatchIt")
10
11cat("=== Lalonde Data (NSW treated + CPS controls) ===\n")
12cat("N treated:", sum(lalonde$treat == 1), "\n")
13cat("N control:", sum(lalonde$treat == 0), "\n")
14
15# Published experimental benchmark from Dehejia & Wahba (1999)
16ate_exp <- 1794
17cat("Experimental benchmark ATE: $", ate_exp, "\n")

RequiresMatchIt cobalt

Expected output:

=== NSW Experimental Benchmark ===
Treated: n = 185, mean RE78 = $6,349
Control: n = 260, mean RE78 = $4,555
Experimental ATE: $1,794
(Dehejia-Wahba report approximately $1,794)

Group	N	Mean RE78
NSW Treated	185	~$6,349
NSW Control (experimental)	260	~$4,555
Experimental ATE	—	~$1,794

This experimental estimate serves as the benchmark against which all observational methods will be evaluated.

Step 2: Construct the Observational Dataset

Replace the experimental control group with a non-experimental comparison group drawn from the Current Population Survey (CPS) or PSID.

1# The MatchIt 'lalonde' dataset already contains CPS-1 controls
2# (loaded in Step 1). Compute the naive difference in means.
3obs <- lalonde
4
5naive_diff <- mean(obs$re78[obs$treat == 1]) - mean(obs$re78[obs$treat == 0])
6cat("Naive difference: $", round(naive_diff), "\n")
7cat("Experimental benchmark ATE: $", ate_exp, "\n")
8cat("Bias: $", round(naive_diff - ate_exp), "\n")

RequiresMatchIt

Expected output:

=== Naive Observational Estimate ===
Treated mean RE78: $6,349
CPS mean RE78:     $14,847
Naive difference:  -$8,498
Experimental ATE:  $1,794

Bias: -$10,292

The CPS group earns much more — naive comparison is severely biased.

Statistic	Value
NSW treated mean RE78	~$6,349
CPS control mean RE78	~$14,847
Naive difference	~-$8,498
Experimental ATE	~$1,794
Bias	~-$10,292

The naive observational difference is not just wrong — it has the wrong sign. Instead of the true positive effect of ~$1,794, the naive estimate suggests the program reduced earnings by ~$8,500.

Concept Check

The naive difference between NSW treated individuals and CPS comparison individuals is very different from the experimental estimate. What is the primary source of this bias?

The CPS sample is too small.The CPS comparison group is demographically very different from the NSW treatment group (selection bias). NSW participants were disadvantaged workers with low prior earnings, while CPS respondents are a broad cross-section of the population.The treatment effect is heterogeneous.Measurement error in earnings.

Step 3: Estimate the Propensity Score

1# Propensity score (DW specification)
2ps_model <- glm(treat ~ age + education + black + hispanic + married +
3              nodegree + re74 + re75,
4              data = obs, family = binomial)
5obs$pscore <- predict(ps_model, type = "response")
6
7# Overlap check
8hist(obs$pscore[obs$treat == 1], col = rgb(0,0,1,0.5), xlim = c(0,1),
9   main = "Propensity Score Distribution", xlab = "Propensity Score")
10hist(obs$pscore[obs$treat == 0], col = rgb(1,0,0,0.5), add = TRUE)
11legend("topright", c("Treated", "Control"), fill = c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)))

Expected output:

PS range (treated): [0.0215, 0.9842]
PS range (control): [0.000002, 0.8215]
Note: most CPS controls have very low propensity scores

Group	PS Min	PS Median	PS Max
NSW Treated	0.021	0.625	0.984
CPS Control	0.000	0.005	0.822

Step 4: Nearest-Neighbor Matching on the Propensity Score

1# Nearest-neighbor matching using MatchIt
2m_nn <- matchit(treat ~ age + education + black + hispanic + married +
3              nodegree + re74 + re75,
4              data = obs, method = "nearest", distance = "glm",
5              caliper = 0.1)
6
7summary(m_nn)
8
9# Extract matched data and estimate ATT
10m_data <- match.data(m_nn)
11ate_nn <- lm(re78 ~ treat, data = m_data, weights = weights)
12cat("NN Matching ATT:", coef(ate_nn)["treat"], "\n")
13cat("Experimental:", ate_exp, "\n")

RequiresMatchIt

Expected output:

=== Nearest-Neighbor PS Matching (1:1) ===
Matched ATE: $1,652
Experimental ATE: $1,794
Bias: -$142

With caliper = 0.05: 142/185 treated units matched
Caliper-matched ATE: $1,815

Method	ATE Estimate	Bias vs. Experimental	N Matched
NN PS Matching (1:1)	~$1,652	-$142	185
Caliper matching (0.05)	~$1,815	+$21	~142

Nearest-neighbor matching on the propensity score recovers an estimate close to the experimental benchmark. Caliper matching drops treated units without close matches, potentially improving the estimate at the cost of a narrower estimand.

Step 5: Coarsened Exact Matching (CEM)

CEM creates strata based on coarsened covariate values, then matches exactly within strata.

1# CEM using MatchIt
2m_cem <- matchit(treat ~ age + education + black + hispanic + married +
3               nodegree + re74 + re75,
4               data = obs, method = "cem",
5               cutpoints = list(age = c(25, 30, 40),
6                                education = c(9, 11, 12),
7                                re74 = c(0, 5000, 15000),
8                                re75 = c(0, 5000, 15000)))
9summary(m_cem)
10
11m_cem_data <- match.data(m_cem)
12ate_cem <- lm(re78 ~ treat, data = m_cem_data, weights = weights)
13cat("CEM ATT:", coef(ate_cem)["treat"], "\n")

RequiresMatchIt

Expected output:

=== Coarsened Exact Matching ===
Matched: 148 treated, 1245 controls
CEM ATE: $1,925
Experimental ATE: $1,794

Method	ATE Estimate	N Treated Matched	N Control Matched
CEM	~$1,925	~148	~1,245
Experimental benchmark	$1,794	—	—

CEM drops ~37 treated units that have no counterpart in any CPS stratum. The estimate is close to the experimental benchmark, but pertains only to the matchable subpopulation.

Concept Check

CEM guarantees exact balance on coarsened covariates within matched strata, but it typically discards many observations. In this Lalonde replication, why is the tradeoff particularly relevant?

Because the sample size is very small.Because the NSW treated group is so different from the CPS that many treated units cannot find any CPS match within the same stratum, potentially changing the estimand from ATT to a 'local ATT' for the matchable subpopulation.Because CEM is less efficient than propensity score matching.Because the coarsening is arbitrary.

Step 6: Compare All Estimates with the Experimental Benchmark

1# Summary comparison
2cat("=== Summary ===\n")
3cat("Experimental:", round(ate_exp), "\n")
4cat("Naive:", round(naive_diff), "\n")
5cat("OLS:", round(coef(lm(re78 ~ treat + age + education + black + hispanic +
6                        married + nodegree + re74 + re75, data = obs))["treat"]), "\n")
7cat("NN Matching:", round(coef(ate_nn)["treat"]), "\n")
8cat("CEM:", round(coef(ate_cem)["treat"]), "\n")
9
10# Balance comparison
11love.plot(m_nn, stats = "mean.diffs", abs = TRUE,
12        title = "Covariate Balance: NN Matching")

Expected output:

=== Summary of Estimates ===
       Method                     ATE       Bias
 Experimental benchmark        $1,794         $0
 Naive (unadjusted)           -$8,498   -$10,292
 OLS with controls             $1,245      -$549
 NN PS matching (1:1)          $1,652      -$142
 PS matching (caliper)         $1,815        $21
 CEM                           $1,925       $131

Method	ATE Estimate	Bias
Experimental benchmark	$1,794	$0
Naive (unadjusted)	-$8,498	-$10,292
OLS with controls	~$1,245	-$549
NN PS matching (1:1)	~$1,652	-$142
PS matching (caliper)	~$1,815	+$21
CEM	~$1,925	+$131

Extension Exercises

Drop pre-treatment earnings. Re-run all matching methods without re74 and re75. How much worse are the estimates? This exercise replicates Lalonde's (1986) original finding.
Use the PSID comparison group. The PSID comparison group is smaller and somewhat more similar to the NSW treated group. Repeat the analysis and compare.
Try doubly robust estimation. Implement an AIPW estimator (combining PS weighting with outcome regression) and compare with matching alone.
Vary the caliper width. Run PS matching with calipers of 0.01, 0.05, 0.10, and 0.25. Plot the tradeoff between the number of matches and the estimated ATE.
Sensitivity analysis. Use the Rosenbaum bounds or the Cinelli-Hazlett framework to assess how sensitive the matching estimate is to unobserved confounders.

Expected output

If your code runs correctly, expect to see:

Experimental benchmark (ATE): Around $1,500–$2,000 (published Dehejia-Wahba: approximately $1,794)
Naive observational difference (CPS comparison): Severely biased, often negative (around -$8,000 to -$15,000) due to extreme covariate imbalance
PSM with CPS controls: Around $1,000–$2,500 when pre-treatment earnings (re74, re75) are included — recovering the experimental benchmark
PSM without pre-treatment earnings: Biased, failing to recover the benchmark
CEM estimate: Around $1,000–$2,500, with exact balance on coarsened covariates but potentially fewer matched observations
Covariate balance (before matching): Large standardized mean differences (>0.5) on age, education, race, and earnings
Covariate balance (after matching): Standardized mean differences below 0.10 on all covariates
Key result: Matching recovers the experimental benchmark only when pre-treatment earnings are included as matching variables

Summary

In this lab you learned:

The NSW experimental benchmark provides a ground truth for evaluating observational estimators
Naive comparisons with non-experimental control groups can be severely biased due to covariate imbalance
Propensity score matching recovers the experimental benchmark when the right covariates (especially lagged outcomes) are available and overlap is sufficient
CEM guarantees exact balance on coarsened covariates but may discard observations, changing the estimand
Covariate balance diagnostics (standardized mean differences, overlap plots) are essential for assessing match quality
The Lalonde dataset remains one of the most widely used benchmarks in causal inference for evaluating the performance of observational estimators against an experimental benchmark

Overview#

Step 1: Load the NSW Experimental Data#

Step 2: Construct the Observational Dataset#

Step 3: Estimate the Propensity Score#

Step 4: Nearest-Neighbor Matching on the Propensity Score#

Step 5: Coarsened Exact Matching (CEM)#

Step 6: Compare All Estimates with the Experimental Benchmark#

Extension Exercises#

Summary#

Overview

Step 1: Load the NSW Experimental Data

Step 2: Construct the Observational Dataset

Step 3: Estimate the Propensity Score

Step 4: Nearest-Neighbor Matching on the Propensity Score

Step 5: Coarsened Exact Matching (CEM)

Step 6: Compare All Estimates with the Experimental Benchmark

Extension Exercises

Summary