Lab: Matching Methods from Scratch
Implement propensity score matching, nearest-neighbor matching, and coarsened exact matching step by step. Learn to assess covariate balance, choose between matching methods, and understand when matching fails.
Overview
Matching methods estimate causal effects by pairing treated units with similar control units based on observed characteristics. The idea is simple: if you can find a control unit that looks exactly like a treated unit on every relevant dimension, their difference in outcomes estimates the treatment effect.
What you will learn:
- How to estimate propensity scores and assess their overlap
- How to implement nearest-neighbor matching (with and without replacement)
- How to implement propensity score matching (PSM) and inverse probability weighting (IPW)
- How to assess covariate balance before and after matching
- When matching works and when it fails (unobserved confounders)
Prerequisites: OLS regression (see the OLS lab). Understanding of selection bias.
Step 1: The Setting
We simulate a job training program inspired by the evaluation. Workers decide whether to enroll in a job training program, and we observe their earnings afterward. The challenge: workers who choose to enroll differ from workers who do not (selection bias).
library(MatchIt)
library(cobalt)
library(estimatr)
set.seed(2024)
n <- 3000
# Pre-treatment covariates
age <- round(pmin(pmax(rnorm(n, 33, 10), 18), 60))
education <- round(pmin(pmax(rnorm(n, 11, 3), 5), 18))
married <- rbinom(n, 1, 0.4)
black <- rbinom(n, 1, 0.3)
hispanic <- rbinom(n, 1, 0.1)
prev_earnings <- rlnorm(n, 9.5, 1.0)
# Selection into treatment (depends on covariates — NOT random!)
propensity_true <- plogis(-2 - 0.03 * age + 0.15 * education -
0.5 * married + 0.8 * black +
0.3 * hispanic - 0.0001 * prev_earnings)
treatment <- rbinom(n, 1, propensity_true)
# Outcome: post-training earnings (true treatment effect = $1,500)
true_ate <- 1500
earnings <- 10000 + 200 * age + 500 * education + 2000 * married -
1000 * black + 0.3 * prev_earnings +
true_ate * treatment + rnorm(n, 0, 3000)
df <- data.frame(earnings, treatment, age, education, married,
black, hispanic, prev_earnings)
# Naive difference is biased due to selection on observables
cat("Naive difference:", mean(earnings[treatment == 1]) -
mean(earnings[treatment == 0]), "\n")
cat("True effect:", true_ate, "\n")Expected output:
Sample: 3000 workers, 852 treated (28.4%)
Naive difference in means:
Treated mean earnings: $26,215
Control mean earnings: $25,348
Naive difference: $867
True treatment effect: $1,500
| Statistic | Value |
|---|---|
| N (total) | 3,000 |
| N (treated) | ~852 |
| Treatment rate | ~28% |
| Treated mean earnings | ~$26,215 |
| Control mean earnings | ~$25,348 |
| Naive difference | ~$867 |
| True ATE | $1,500 |
The naive difference (~1,500) because treated individuals differ systematically from controls on characteristics that also affect earnings.
Step 2: Assess Covariate Imbalance (Pre-Matching)
Before matching, document how different the treated and control groups are.
# Balance table using cobalt
bal.tab(treatment ~ age + education + married + black + hispanic +
prev_earnings, data = df, un = TRUE,
stats = c("m", "v"), thresholds = c(m = 0.1))
# Standardized mean differences > 0.1 indicate imbalanceExpected output:
Pre-Matching Covariate Balance
----------------------------------------------------------------------
Variable Treated Control Diff Std Diff
----------------------------------------------------------------------
age 31.25 33.82 -2.57 -0.254
education 11.78 10.72 1.06 0.345
married 0.31 0.44 -0.13 -0.262
black 0.42 0.25 0.17 0.377
hispanic 0.12 0.09 0.03 0.102
prev_earnings 12845.20 15621.40 -2776.20 -0.183
----------------------------------------------------------------------
Standardized differences > 0.1 (10%) indicate meaningful imbalance.
| Variable | Treated | Control | Std Diff | Balanced? |
|---|---|---|---|---|
| age | 31.25 | 33.82 | -0.254 | No |
| education | 11.78 | 10.72 | 0.345 | No |
| married | 0.31 | 0.44 | -0.262 | No |
| black | 0.42 | 0.25 | 0.377 | No |
| hispanic | 0.12 | 0.09 | 0.102 | Borderline |
| prev_earnings | 12,845 | 15,621 | -0.183 | No |
Nearly all covariates show standardized differences well above the 0.10 threshold, confirming substantial pre-matching imbalance.
Step 3: Estimate Propensity Scores
The propensity score is the probability of being treated given observed covariates: . Matching on the propensity score is equivalent to matching on all covariates simultaneously ((Rosenbaum & Rubin, 1983)).
# Estimate propensity scores
ps_model <- glm(treatment ~ age + education + married + black +
hispanic + prev_earnings,
data = df, family = binomial)
df$pscore <- predict(ps_model, type = "response")
# Overlap check
library(ggplot2)
ggplot(df, aes(x = pscore, fill = factor(treatment))) +
geom_histogram(alpha = 0.5, position = "identity", bins = 50) +
scale_fill_manual(values = c("#0072B2", "#D55E00"),
labels = c("Control", "Treated")) +
labs(title = "Propensity Score Overlap",
x = "Propensity Score", fill = "Group") +
theme_minimal()Expected output:
Propensity Score Summary:
| Statistic | Treated | Control |
|---|---|---|
| count | 852 | 2148 |
| mean | 0.345 | 0.254 |
| std | 0.138 | 0.121 |
| min | 0.042 | 0.015 |
| 25% | 0.238 | 0.165 |
| 50% | 0.332 | 0.240 |
| 75% | 0.438 | 0.328 |
| max | 0.785 | 0.742 |
Step 4: Propensity Score Matching
Match each treated unit to the control unit with the closest propensity score.
# Propensity score matching using MatchIt
m_ps <- matchit(treatment ~ age + education + married + black +
hispanic + prev_earnings,
data = df, method = "nearest",
distance = "glm", replace = TRUE)
summary(m_ps)
# Estimate ATT on matched sample
matched_data <- match.data(m_ps)
att_model <- lm_robust(earnings ~ treatment, data = matched_data,
se_type = "HC2")
cat("\nATT estimate:", coef(att_model)["treatment"], "\n")
cat("True effect:", true_ate, "\n")Expected output:
Propensity Score Matching (1:1, nearest neighbor)
ATT estimate: $1,582
True effect: $1,500
N treated: 852
N matched controls: 748 unique
| Estimator | ATT Estimate | True ATE |
|---|---|---|
| PSM (1:1 NN) | ~$1,582 | $1,500 |
The PSM estimate is close to the true effect. Some control units are used more than once (748 unique matches for 852 treated units), which is typical with matching with replacement.
Step 5: Assess Post-Matching Balance
The primary objective of matching is to make treated and control groups comparable on observed characteristics. After matching, it is essential to verify whether the procedure achieved adequate covariate balance.
# Post-matching balance using cobalt
bal.tab(m_ps, un = TRUE, stats = c("m", "v"), thresholds = c(m = 0.1))
# Love plot (visual balance assessment)
love.plot(m_ps, thresholds = c(m = 0.1),
var.order = "unadjusted",
title = "Covariate Balance Before and After Matching")Expected output:
Post-Matching Covariate Balance
-------------------------------------------------------
Variable Treated Control Std Diff
-------------------------------------------------------
age 31.25 31.48 -0.023
education 11.78 11.72 0.020
married 0.31 0.30 0.020
black 0.42 0.41 0.022
prev_earnings 12845.20 12918.50 -0.005
-------------------------------------------------------
All standardized differences should be below 0.1 after matching.
| Variable | Pre-Match Std Diff | Post-Match Std Diff | Improved? |
|---|---|---|---|
| age | -0.254 | -0.023 | Yes |
| education | 0.345 | 0.020 | Yes |
| married | -0.262 | 0.020 | Yes |
| black | 0.377 | 0.022 | Yes |
| prev_earnings | -0.183 | -0.005 | Yes |
All standardized differences are now well below the 0.10 threshold, indicating excellent post-matching balance.
After matching, you find that standardized differences are below 0.05 for all covariates. Does this mean your treatment effect estimate is unbiased?
Step 6: Alternative Matching Methods
Coarsened Exact Matching (CEM)
Instead of matching on the propensity score, CEM bins covariates into categories and matches exactly within bins.
# CEM using MatchIt
m_cem <- matchit(treatment ~ age + education + married + black +
hispanic + prev_earnings,
data = df, method = "cem",
cutpoints = list(age = c(25, 35, 45),
education = c(9, 12, 15)))
summary(m_cem)
# ATT estimate
cem_data <- match.data(m_cem)
att_cem <- lm_robust(earnings ~ treatment, data = cem_data,
weights = cem_data$weights, se_type = "HC2")
cat("CEM ATT:", coef(att_cem)["treatment"], "\n")
cat("True effect:", true_ate, "\n")Expected output:
CEM: 2485 matched observations out of 3000 total
Matched cells: 142
CEM ATT estimate: $1,425
True effect: $1,500
| Estimator | Matched Obs | ATT Estimate | True ATE |
|---|---|---|---|
| CEM | ~2,485 | ~$1,425 | $1,500 |
CEM discards ~515 observations that have no exact match within their coarsened stratum. The estimate is close to the true effect.
Inverse Probability Weighting (IPW)
Instead of dropping unmatched units, IPW reweights the sample so that the weighted distribution of covariates is balanced.
# IPW for ATT
df$ipw_weight <- ifelse(df$treatment == 1, 1,
df$pscore / (1 - df$pscore))
# Trim extreme weights
df$ipw_weight_trimmed <- pmin(df$ipw_weight,
quantile(df$ipw_weight, 0.99))
# Weighted regression
m_ipw <- lm_robust(earnings ~ treatment, data = df,
weights = ipw_weight_trimmed, se_type = "HC2")
cat("IPW ATT:", coef(m_ipw)["treatment"], "\n")
cat("True effect:", true_ate, "\n")Expected output:
IPW ATT estimate: $1,548
True effect: $1,500
Max weight (untrimmed): 12.5
Max weight (trimmed): 4.2
| Estimator | ATT Estimate | True ATE | Max Weight |
|---|---|---|---|
| IPW (trimmed) | ~$1,548 | $1,500 | ~4.2 |
Trimming extreme weights at the 99th percentile stabilizes the estimate while keeping it close to the true effect.
Step 7: Compare All Methods
# Naive difference (biased due to selection)
naive <- mean(df$earnings[df$treatment == 1]) -
mean(df$earnings[df$treatment == 0])
# OLS with controls (linear adjustment for confounders)
ols_ctrl <- lm_robust(earnings ~ treatment + age + education + married +
black + hispanic + prev_earnings,
data = df, se_type = "HC2")
# Compare all methods against the true effect
cat("=== Method Comparison ===\n")
cat(sprintf("True Effect: %7.0f\n", true_ate))
cat(sprintf("Naive: %7.0f\n", naive))
cat(sprintf("OLS + Controls: %7.0f\n", coef(ols_ctrl)["treatment"]))
cat(sprintf("PSM: %7.0f\n", coef(att_model)["treatment"]))
cat(sprintf("CEM: %7.0f\n", coef(att_cem)["treatment"]))
cat(sprintf("IPW: %7.0f\n", coef(m_ipw)["treatment"]))Expected output:
=== Method Comparison ===
Method Estimate Bias
--------------------------------------------------
True Effect $1,500 $0
Naive Difference $867 -$633
OLS with Controls $1,478 -$22
PSM (1:1 NN) $1,582 +$82
CEM $1,425 -$75
IPW $1,548 +$48
| Method | Estimate | Bias | Bias (%) |
|---|---|---|---|
| True Effect | $1,500 | $0 | 0% |
| Naive Difference | ~$867 | -$633 | -42% |
| OLS with Controls | ~$1,478 | -$22 | -1% |
| PSM (1:1 NN) | ~$1,582 | +$82 | +5% |
| CEM | ~$1,425 | -$75 | -5% |
| IPW | ~$1,548 | +$48 | +3% |
All matching/weighting methods substantially reduce the bias compared to the naive difference. OLS, PSM, CEM, and IPW are all within ~633.
Step 8: Exercises
-
Matching with replacement. Allow control units to be matched to multiple treated units. How does this affect the estimate and balance?
-
Caliper matching. Set a maximum distance for matches (caliper = 0.1 standard deviations of the propensity score). How many units are dropped? How does the estimate change?
-
Sensitivity analysis. Use the Rosenbaum bounds approach to assess how sensitive your estimate is to an unobserved confounder. At what level of unobserved confounding does the estimate become insignificant?
-
Bad control. Add a post-treatment variable (e.g., hours worked after training) to the matching covariates. Show that this introduces bias by conditioning on a collider.
Summary
In this lab you learned:
- Matching reduces selection bias by comparing treated units to similar control units
- Propensity score matching, CEM, and IPW are three common approaches with different tradeoffs
- It is important to check and report covariate balance before and after matching
- Matching requires the "selection on observables" assumption — if unobserved confounders drive selection, matching will be biased
- Trim extreme propensity score weights to avoid instability
- Generally avoid matching on post-treatment variables (bad controls)
- Multiple matching methods should give similar results if the design is sound; large discrepancies signal problems