MethodAtlas
tutorial2 hours

Lab: Matching Methods from Scratch

Implement propensity score matching, nearest-neighbor matching, and coarsened exact matching step by step. Learn to assess covariate balance, choose between matching methods, and understand when matching fails.

Overview

Matching methods estimate causal effects by pairing treated units with similar control units based on observed characteristics. The idea is simple: if you can find a control unit that looks exactly like a treated unit on every relevant dimension, their difference in outcomes estimates the treatment effect.

What you will learn:

  • How to estimate propensity scores and assess their overlap
  • How to implement nearest-neighbor matching (with and without replacement)
  • How to implement propensity score matching (PSM) and inverse probability weighting (IPW)
  • How to assess covariate balance before and after matching
  • When matching works and when it fails (unobserved confounders)

Prerequisites: OLS regression (see the OLS lab). Understanding of selection bias.


Step 1: The Setting

We simulate a job training program inspired by the evaluation. Workers decide whether to enroll in a job training program, and we observe their earnings afterward. The challenge: workers who choose to enroll differ from workers who do not (selection bias).

library(MatchIt)
library(cobalt)
library(estimatr)

set.seed(2024)
n <- 3000

# Pre-treatment covariates
age <- round(pmin(pmax(rnorm(n, 33, 10), 18), 60))
education <- round(pmin(pmax(rnorm(n, 11, 3), 5), 18))
married <- rbinom(n, 1, 0.4)
black <- rbinom(n, 1, 0.3)
hispanic <- rbinom(n, 1, 0.1)
prev_earnings <- rlnorm(n, 9.5, 1.0)

# Selection into treatment (depends on covariates — NOT random!)
propensity_true <- plogis(-2 - 0.03 * age + 0.15 * education -
                         0.5 * married + 0.8 * black +
                         0.3 * hispanic - 0.0001 * prev_earnings)
treatment <- rbinom(n, 1, propensity_true)

# Outcome: post-training earnings (true treatment effect = $1,500)
true_ate <- 1500
earnings <- 10000 + 200 * age + 500 * education + 2000 * married -
          1000 * black + 0.3 * prev_earnings +
          true_ate * treatment + rnorm(n, 0, 3000)

df <- data.frame(earnings, treatment, age, education, married,
               black, hispanic, prev_earnings)

# Naive difference is biased due to selection on observables
cat("Naive difference:", mean(earnings[treatment == 1]) -
  mean(earnings[treatment == 0]), "\n")
cat("True effect:", true_ate, "\n")

Expected output:

Sample: 3000 workers, 852 treated (28.4%)

Naive difference in means:
  Treated mean earnings:   $26,215
  Control mean earnings:   $25,348
  Naive difference:        $867
  True treatment effect:   $1,500
StatisticValue
N (total)3,000
N (treated)~852
Treatment rate~28%
Treated mean earnings~$26,215
Control mean earnings~$25,348
Naive difference~$867
True ATE$1,500

The naive difference (~867)underestimatesthetrueeffect(867) underestimates the true effect (1,500) because treated individuals differ systematically from controls on characteristics that also affect earnings.


Step 2: Assess Covariate Imbalance (Pre-Matching)

Before matching, document how different the treated and control groups are.

# Balance table using cobalt
bal.tab(treatment ~ age + education + married + black + hispanic +
      prev_earnings, data = df, un = TRUE,
      stats = c("m", "v"), thresholds = c(m = 0.1))

# Standardized mean differences > 0.1 indicate imbalance
Requirescobalt

Expected output:

Pre-Matching Covariate Balance
----------------------------------------------------------------------
Variable            Treated    Control       Diff   Std Diff
----------------------------------------------------------------------
age                   31.25      33.82      -2.57     -0.254
education             11.78      10.72       1.06      0.345
married                0.31       0.44      -0.13     -0.262
black                  0.42       0.25       0.17      0.377
hispanic               0.12       0.09       0.03      0.102
prev_earnings      12845.20   15621.40   -2776.20     -0.183
----------------------------------------------------------------------
Standardized differences > 0.1 (10%) indicate meaningful imbalance.
VariableTreatedControlStd DiffBalanced?
age31.2533.82-0.254No
education11.7810.720.345No
married0.310.44-0.262No
black0.420.250.377No
hispanic0.120.090.102Borderline
prev_earnings12,84515,621-0.183No

Nearly all covariates show standardized differences well above the 0.10 threshold, confirming substantial pre-matching imbalance.


Step 3: Estimate Propensity Scores

The propensity score is the probability of being treated given observed covariates: e(X)=P(D=1X)e(X) = P(D = 1 | X). Matching on the propensity score is equivalent to matching on all covariates simultaneously ((Rosenbaum & Rubin, 1983)).

# Estimate propensity scores
ps_model <- glm(treatment ~ age + education + married + black +
               hispanic + prev_earnings,
               data = df, family = binomial)
df$pscore <- predict(ps_model, type = "response")

# Overlap check
library(ggplot2)
ggplot(df, aes(x = pscore, fill = factor(treatment))) +
geom_histogram(alpha = 0.5, position = "identity", bins = 50) +
scale_fill_manual(values = c("#0072B2", "#D55E00"),
                  labels = c("Control", "Treated")) +
labs(title = "Propensity Score Overlap",
     x = "Propensity Score", fill = "Group") +
theme_minimal()
Requiresggplot2

Expected output:

Propensity Score Summary:
StatisticTreatedControl
count8522148
mean0.3450.254
std0.1380.121
min0.0420.015
25%0.2380.165
50%0.3320.240
75%0.4380.328
max0.7850.742

Step 4: Propensity Score Matching

Match each treated unit to the control unit with the closest propensity score.

# Propensity score matching using MatchIt
m_ps <- matchit(treatment ~ age + education + married + black +
               hispanic + prev_earnings,
               data = df, method = "nearest",
               distance = "glm", replace = TRUE)

summary(m_ps)

# Estimate ATT on matched sample
matched_data <- match.data(m_ps)
att_model <- lm_robust(earnings ~ treatment, data = matched_data,
                      se_type = "HC2")
cat("\nATT estimate:", coef(att_model)["treatment"], "\n")
cat("True effect:", true_ate, "\n")
RequiresMatchIt

Expected output:

Propensity Score Matching (1:1, nearest neighbor)
ATT estimate:       $1,582
True effect:        $1,500
N treated:          852
N matched controls: 748 unique
EstimatorATT EstimateTrue ATE
PSM (1:1 NN)~$1,582$1,500

The PSM estimate is close to the true effect. Some control units are used more than once (748 unique matches for 852 treated units), which is typical with matching with replacement.


Step 5: Assess Post-Matching Balance

The primary objective of matching is to make treated and control groups comparable on observed characteristics. After matching, it is essential to verify whether the procedure achieved adequate covariate balance.

# Post-matching balance using cobalt
bal.tab(m_ps, un = TRUE, stats = c("m", "v"), thresholds = c(m = 0.1))

# Love plot (visual balance assessment)
love.plot(m_ps, thresholds = c(m = 0.1),
        var.order = "unadjusted",
        title = "Covariate Balance Before and After Matching")
Requirescobalt

Expected output:

Post-Matching Covariate Balance
-------------------------------------------------------
Variable            Treated    Control   Std Diff
-------------------------------------------------------
age                   31.25      31.48     -0.023
education             11.78      11.72      0.020
married                0.31       0.30      0.020
black                  0.42       0.41      0.022
prev_earnings      12845.20   12918.50     -0.005
-------------------------------------------------------
All standardized differences should be below 0.1 after matching.
VariablePre-Match Std DiffPost-Match Std DiffImproved?
age-0.254-0.023Yes
education0.3450.020Yes
married-0.2620.020Yes
black0.3770.022Yes
prev_earnings-0.183-0.005Yes

All standardized differences are now well below the 0.10 threshold, indicating excellent post-matching balance.

Concept Check

After matching, you find that standardized differences are below 0.05 for all covariates. Does this mean your treatment effect estimate is unbiased?


Step 6: Alternative Matching Methods

Coarsened Exact Matching (CEM)

Instead of matching on the propensity score, CEM bins covariates into categories and matches exactly within bins.

# CEM using MatchIt
m_cem <- matchit(treatment ~ age + education + married + black +
                hispanic + prev_earnings,
                data = df, method = "cem",
                cutpoints = list(age = c(25, 35, 45),
                                 education = c(9, 12, 15)))

summary(m_cem)

# ATT estimate
cem_data <- match.data(m_cem)
att_cem <- lm_robust(earnings ~ treatment, data = cem_data,
                    weights = cem_data$weights, se_type = "HC2")
cat("CEM ATT:", coef(att_cem)["treatment"], "\n")
cat("True effect:", true_ate, "\n")
RequiresMatchIt

Expected output:

CEM: 2485 matched observations out of 3000 total
Matched cells: 142

CEM ATT estimate: $1,425
True effect:      $1,500
EstimatorMatched ObsATT EstimateTrue ATE
CEM~2,485~$1,425$1,500

CEM discards ~515 observations that have no exact match within their coarsened stratum. The estimate is close to the true effect.

Inverse Probability Weighting (IPW)

Instead of dropping unmatched units, IPW reweights the sample so that the weighted distribution of covariates is balanced.

# IPW for ATT
df$ipw_weight <- ifelse(df$treatment == 1, 1,
                       df$pscore / (1 - df$pscore))

# Trim extreme weights
df$ipw_weight_trimmed <- pmin(df$ipw_weight,
                             quantile(df$ipw_weight, 0.99))

# Weighted regression
m_ipw <- lm_robust(earnings ~ treatment, data = df,
                  weights = ipw_weight_trimmed, se_type = "HC2")
cat("IPW ATT:", coef(m_ipw)["treatment"], "\n")
cat("True effect:", true_ate, "\n")

Expected output:

IPW ATT estimate: $1,548
True effect:      $1,500

Max weight (untrimmed): 12.5
Max weight (trimmed):   4.2
EstimatorATT EstimateTrue ATEMax Weight
IPW (trimmed)~$1,548$1,500~4.2

Trimming extreme weights at the 99th percentile stabilizes the estimate while keeping it close to the true effect.


Step 7: Compare All Methods

# Naive difference (biased due to selection)
naive <- mean(df$earnings[df$treatment == 1]) -
       mean(df$earnings[df$treatment == 0])

# OLS with controls (linear adjustment for confounders)
ols_ctrl <- lm_robust(earnings ~ treatment + age + education + married +
                     black + hispanic + prev_earnings,
                     data = df, se_type = "HC2")

# Compare all methods against the true effect
cat("=== Method Comparison ===\n")
cat(sprintf("True Effect:      %7.0f\n", true_ate))
cat(sprintf("Naive:            %7.0f\n", naive))
cat(sprintf("OLS + Controls:   %7.0f\n", coef(ols_ctrl)["treatment"]))
cat(sprintf("PSM:              %7.0f\n", coef(att_model)["treatment"]))
cat(sprintf("CEM:              %7.0f\n", coef(att_cem)["treatment"]))
cat(sprintf("IPW:              %7.0f\n", coef(m_ipw)["treatment"]))

Expected output:

=== Method Comparison ===
Method                          Estimate      Bias
--------------------------------------------------
True Effect                       $1,500        $0
Naive Difference                    $867      -$633
OLS with Controls                 $1,478       -$22
PSM (1:1 NN)                      $1,582       +$82
CEM                               $1,425       -$75
IPW                               $1,548       +$48
MethodEstimateBiasBias (%)
True Effect$1,500$00%
Naive Difference~$867-$633-42%
OLS with Controls~$1,478-$22-1%
PSM (1:1 NN)~$1,582+$82+5%
CEM~$1,425-$75-5%
IPW~$1,548+$48+3%

All matching/weighting methods substantially reduce the bias compared to the naive difference. OLS, PSM, CEM, and IPW are all within ~100ofthetrueeffect,whilethenaivedifferencemissesby 100 of the true effect, while the naive difference misses by ~633.


Step 8: Exercises

  1. Matching with replacement. Allow control units to be matched to multiple treated units. How does this affect the estimate and balance?

  2. Caliper matching. Set a maximum distance for matches (caliper = 0.1 standard deviations of the propensity score). How many units are dropped? How does the estimate change?

  3. Sensitivity analysis. Use the Rosenbaum bounds approach to assess how sensitive your estimate is to an unobserved confounder. At what level of unobserved confounding does the estimate become insignificant?

  4. Bad control. Add a post-treatment variable (e.g., hours worked after training) to the matching covariates. Show that this introduces bias by conditioning on a collider.


Summary

In this lab you learned:

  • Matching reduces selection bias by comparing treated units to similar control units
  • Propensity score matching, CEM, and IPW are three common approaches with different tradeoffs
  • It is important to check and report covariate balance before and after matching
  • Matching requires the "selection on observables" assumption — if unobserved confounders drive selection, matching will be biased
  • Trim extreme propensity score weights to avoid instability
  • Generally avoid matching on post-treatment variables (bad controls)
  • Multiple matching methods should give similar results if the design is sound; large discrepancies signal problems