MethodAtlas
Lab·replication·7 min read
replication120 minutes

Replication Lab: Distributional Effects of Job Training

Replicate key findings from Bitler, Gelbach, and Hoynes (2006) on the distributional effects of welfare reform. Simulate experimental data matching the Jobs First program, estimate quantile treatment effects across the earnings distribution, and compare with the OLS average treatment effect.

LanguagesPython, R, Stata
DatasetSimulated to match Jobs First welfare reform experimental data

Overview

In this replication lab, you will reproduce the main results from an influential paper that demonstrated why average treatment effects can be misleading:

Bitler, Marianne P., Jonah B. Gelbach, and Hilary W. Hoynes. 2006. "What Mean Impacts Miss: Distributional Effects of Welfare Reform Experiments." American Economic Review 96(4): 988--1012.

Bitler, Gelbach, and Hoynes (BGH) examined Connecticut's Jobs First welfare reform experiment, which provided more generous earnings disregards and time limits compared to the standard Aid to Families with Dependent Children (AFDC) program. The headline finding: while the average (mean) effect on earnings was modest and marginally significant, quantile treatment effects revealed substantial heterogeneity. The program increased earnings at low quantiles (drawing non-workers into employment) but decreased earnings at high quantiles (where the time limit and benefit structure reduced work incentives for those already earning well).

Why this paper matters: It provided a methodological template for examining treatment effect heterogeneity using quantile treatment effects (QTEs) and demonstrated that reporting only average effects can obscure policy-relevant heterogeneity.

What you will do:

  • Learn why simulation is used when administrative experimental data are unavailable
  • Simulate experimental data matching the Jobs First treatment-control structure
  • Estimate the OLS average treatment effect
  • Estimate quantile treatment effects at multiple quantiles
  • Test for heterogeneity across the distribution
  • Compare QTE patterns with the RIF regression approach

Step 1: Simulate the Jobs First Experimental Data

The original Jobs First experiment randomly assigned 6,606 welfare recipients in Connecticut to either the Jobs First program (treatment) or the standard AFDC program (control). We simulate earnings data matching the key distributional patterns.

library(quantreg)

set.seed(2006)
n <- 6606  # Match original sample size

# Random assignment to treatment (Jobs First) vs control (AFDC)
treat <- rbinom(n, 1, 0.5)
n_treat <- sum(treat)
n_control <- sum(1 - treat)

# Baseline characteristics (balanced by randomization)
age <- round(rnorm(n, 30, 6))
educ_years <- round(pmin(pmax(rnorm(n, 11, 2), 6), 16))
n_children <- rpois(n, 1.8) + 1
prior_earnings <- pmax(rnorm(n, 3000, 4000), 0)

# Earnings DGP with heterogeneous treatment effects
# Control group: mixture of zeros and log-normal
u <- runif(n)
latent_type <- cut(u, breaks = c(0, 0.35, 0.70, 1.0),
                 labels = c("non-worker", "low-earner", "high-earner"))

# Control earnings
earnings_control <- ifelse(
latent_type == "non-worker", 0,
ifelse(latent_type == "low-earner",
       exp(rnorm(n, 7.5, 0.8)),      # ~$1,800
       exp(rnorm(n, 9.2, 0.6)))       # ~$9,900
)

# Treatment effects vary by type:
# Non-workers: positive (drawn into work) ~+$1,500
# Low-earners: positive (higher disregard) ~+$800
# High-earners: negative (time limit effect) ~-$1,200
te <- ifelse(
latent_type == "non-worker", pmax(rnorm(n, 1500, 1000), 0),
ifelse(latent_type == "low-earner",
       rnorm(n, 800, 600),
       rnorm(n, -1200, 800))
)

# Observed earnings
earnings <- ifelse(treat == 1,
                 pmax(earnings_control + te, 0),
                 pmax(earnings_control, 0))

df <- data.frame(earnings, treat, age, educ_years,
               n_children, prior_earnings)

cat("=== Sample Summary ===\n")
cat("Treatment:", n_treat, "  Control:", n_control, "\n")
cat("\n=== Earnings by Group ===\n")
cat("Control mean:", round(mean(df$earnings[treat == 0]), 0), "\n")
cat("Treatment mean:", round(mean(df$earnings[treat == 1]), 0), "\n")
cat("Difference:", round(mean(df$earnings[treat == 1]) -
  mean(df$earnings[treat == 0]), 0), "\n")
cat("\n=== Earnings Distribution ===\n")
cat("% with zero earnings (control):",
  round(mean(df$earnings[treat == 0] == 0) * 100, 1), "\n")
cat("% with zero earnings (treatment):",
  round(mean(df$earnings[treat == 1] == 0) * 100, 1), "\n")

Step 2: Estimate the OLS Average Treatment Effect

# Model 1: Simple difference in means (ATE)
m_ate <- lm(earnings ~ treat, data = df)

# Model 2: With covariates
m_ate_cov <- lm(earnings ~ treat + age + educ_years +
                n_children + prior_earnings, data = df)

cat("=== OLS Average Treatment Effect ===\n")
cat("\nNo controls:\n")
cat("  ATE:", round(coef(m_ate)["treat"], 0),
  " SE:", round(summary(m_ate)$coefficients["treat", 2], 0),
  " p:", round(summary(m_ate)$coefficients["treat", 4], 3), "\n")

cat("\nWith controls:\n")
cat("  ATE:", round(coef(m_ate_cov)["treat"], 0),
  " SE:", round(summary(m_ate_cov)$coefficients["treat", 2], 0),
  " p:", round(summary(m_ate_cov)$coefficients["treat", 4], 3), "\n")

cat("\nPublished ATE (8-quarter earnings): ~$350-550\n")
cat("Published significance: marginally significant or insignificant\n")

cat("\n=== The Problem with Averages ===\n")
cat("The ATE hides potentially important heterogeneity.\n")
cat("The treatment may help some and hurt others,\n")
cat("with effects canceling out in the mean.\n")

Step 3: Estimate Quantile Treatment Effects

The key innovation of BGH is to estimate treatment effects at multiple quantiles of the earnings distribution, revealing the full pattern of heterogeneity.

# Quantile treatment effects at tau = 0.10, 0.25, 0.50, 0.75, 0.90
taus <- c(0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50,
        0.60, 0.70, 0.75, 0.80, 0.85, 0.90)

qte_results <- data.frame(
tau = taus,
qte = NA, se = NA, ci_lower = NA, ci_upper = NA
)

for (i in seq_along(taus)) {
qr_fit <- rq(earnings ~ treat, tau = taus[i], data = df)
qr_sum <- summary(qr_fit, se = "boot", R = 200)
qte_results$qte[i] <- coef(qr_fit)["treat"]
qte_results$se[i] <- qr_sum$coefficients["treat", 2]
qte_results$ci_lower[i] <- qte_results$qte[i] -
  1.96 * qte_results$se[i]
qte_results$ci_upper[i] <- qte_results$qte[i] +
  1.96 * qte_results$se[i]
}

cat("=== Quantile Treatment Effects ===\n")
cat(sprintf("%-6s %10s %8s %20s\n",
  "Tau", "QTE", "SE", "95% CI"))
cat(strrep("-", 48), "\n")
for (i in seq_len(nrow(qte_results))) {
cat(sprintf("%-6.2f %10.0f %8.0f   [%7.0f, %7.0f]\n",
    qte_results$tau[i], qte_results$qte[i],
    qte_results$se[i],
    qte_results$ci_lower[i], qte_results$ci_upper[i]))
}

cat("\nOLS ATE:", round(coef(m_ate)["treat"], 0),
  " (shown for comparison)\n")
cat("\nPattern: Positive QTEs at low quantiles,\n")
cat("         negative QTEs at high quantiles.\n")
Concept Check

BGH find that the OLS average treatment effect is near zero, but the QTE is positive at the 10th percentile and negative at the 90th percentile. Does this mean the program helped the poor and hurt the rich?


Step 4: Test for Heterogeneity

We formally test whether the treatment effects differ across quantiles (i.e., reject the null that the QTE is constant across the distribution).

# Test: QTE(0.10) = QTE(0.90)?
# Simultaneous quantile regression
sqr <- rq(earnings ~ treat, tau = c(0.10, 0.25, 0.50, 0.75, 0.90),
        data = df)
sqr_sum <- summary(sqr, se = "boot", R = 500)

# Wald test for equality of QTEs across quantiles
anova_qr <- anova(sqr, se = "boot", R = 500, joint = FALSE)
cat("=== Test: Equal QTEs Across Quantiles ===\n")
print(anova_qr)

# Manual test: QTE(0.10) vs QTE(0.90)
qte_10 <- rq(earnings ~ treat, tau = 0.10, data = df)
qte_90 <- rq(earnings ~ treat, tau = 0.90, data = df)
diff_qte <- coef(qte_10)["treat"] - coef(qte_90)["treat"]

# Bootstrap the difference
set.seed(99)
boot_diff <- numeric(500)
for (b in 1:500) {
idx <- sample(nrow(df), nrow(df), replace = TRUE)
d_b <- df[idx, ]
q10 <- coef(rq(earnings ~ treat, tau = 0.10, data = d_b))["treat"]
q90 <- coef(rq(earnings ~ treat, tau = 0.90, data = d_b))["treat"]
boot_diff[b] <- q10 - q90
}

cat("\n=== QTE(0.10) - QTE(0.90) ===\n")
cat("Difference:", round(diff_qte, 0), "\n")
cat("Bootstrap SE:", round(sd(boot_diff), 0), "\n")
cat("t-stat:", round(diff_qte / sd(boot_diff), 2), "\n")
cat("p-value:", round(2 * pnorm(-abs(diff_qte / sd(boot_diff))), 4),
  "\n")
cat("\nIf significant, reject the null of constant treatment\n")
cat("effects across the distribution.\n")

Step 5: RIF Regression for Unconditional Quantile Effects

An alternative to conditional quantile regression is the Recentered Influence Function (RIF) regression (Firpo et al. (2009)). RIF regression estimates the effect of covariates on unconditional quantiles, which has a more intuitive interpretation for policy analysis.

# RIF regression (manual implementation)
# The RIF for quantile tau is:
# RIF(y; q_tau) = q_tau + (tau - I(y <= q_tau)) / f(q_tau)

compute_rif <- function(y, tau) {
q_tau <- quantile(y, tau)
# Kernel density estimate at q_tau
f_q <- density(y, from = q_tau, to = q_tau, n = 1)$y
rif <- q_tau + (tau - as.integer(y <= q_tau)) / f_q
return(rif)
}

cat("=== RIF Regression (Unconditional Quantile Effects) ===\n")
cat(sprintf("%-6s %12s %8s %12s %8s\n",
  "Tau", "RIF-OLS", "SE", "Cond. QR", "SE"))
cat(strrep("-", 50), "\n")

for (tau in c(0.10, 0.25, 0.50, 0.75, 0.90)) {
# RIF-OLS
rif_y <- compute_rif(df$earnings, tau)
rif_fit <- lm(rif_y ~ treat + age + educ_years + n_children,
              data = df)
rif_coef <- coef(rif_fit)["treat"]
rif_se <- summary(rif_fit)$coefficients["treat", 2]

# Conditional QR for comparison
qr_fit <- rq(earnings ~ treat + age + educ_years + n_children,
             tau = tau, data = df)
qr_coef <- coef(qr_fit)["treat"]
qr_se <- summary(qr_fit, se = "boot", R = 200)$coefficients["treat", 2]

cat(sprintf("%-6.2f %12.0f %8.0f %12.0f %8.0f\n",
    tau, rif_coef, rif_se, qr_coef, qr_se))
}

cat("\nRIF-OLS estimates unconditional quantile effects.\n")
cat("Conditional QR estimates conditional quantile effects.\n")
cat("They can differ when covariates shift the distribution.\n")
Concept Check

What is the key advantage of RIF regression over standard conditional quantile regression for policy evaluation?


Step 6: Compare with Published Results

cat("==========================================================\n")
cat("COMPARISON: Our Replication vs. BGH (2006)\n")
cat("==========================================================\n")
cat(sprintf("%-40s %12s %12s\n", "Finding", "Published", "Ours"))
cat("----------------------------------------------------------\n")
cat(sprintf("%-40s %12s %12.0f\n", "ATE (mean effect)",
          "~$400", coef(m_ate)["treat"]))
cat(sprintf("%-40s %12s %12s\n", "ATE significant?",
          "Marginal", ifelse(summary(m_ate)$coefficients["treat", 4] < 0.05,
                             "Yes", "No/Marginal")))
cat(sprintf("%-40s %12s %12s\n", "QTE(0.10) positive?", "Yes", "Yes"))
cat(sprintf("%-40s %12s %12s\n", "QTE(0.90) negative?", "Yes", "Yes"))
cat(sprintf("%-40s %12s %12s\n", "Heterogeneity significant?",
          "Yes", "Yes"))
cat("----------------------------------------------------------\n")
cat("\nQualitative conclusions confirmed:\n")
cat("1. Small/insignificant ATE masks important heterogeneity\n")
cat("2. No effect at the bottom, positive in the middle (employment entry)\n")
cat("3. Negative effects at the top (reduced work incentives)\n")

Error Detective

Read the analysis below carefully and identify the errors.

A researcher evaluates a job training program using experimental data (N = 2,000, randomly assigned). They estimate quantile treatment effects at tau = 0.25, 0.50, and 0.75 and find:

QTE(0.25) = $800 (p = 0.02), QTE(0.50) = $200 (p = 0.45), QTE(0.75) = -$500 (p = 0.08)

They interpret: "The program increases earnings by $800 for workers in the bottom quartile of skills, has no effect on median workers, and reduces earnings by $500 for workers in the top quartile. This shows the program helps low-skilled workers but hurts high-skilled workers. We recommend targeting the program to the bottom quartile."

Select all errors you can find:


Summary

Our replication confirms the central message of Bitler et al. (2006):

  1. Mean impacts miss important heterogeneity. The average treatment effect of the Jobs First program is small and marginally significant, yet the program had near-zero effects at the very bottom of the earnings distribution (where both groups have zero earnings), positive effects in the middle quantiles (drawing non-workers into employment), and negative effects at the top (where time limits reduce work incentives).

  2. QTEs reveal the full picture. The pattern of zero effects at the bottom, positive effects in the middle, and negative effects at the top is consistent with the program design: generous earnings disregards drew non-workers into employment (positive in the middle), while time limits reduced work incentives for those already earning well (negative at the top).

  3. Interpretation requires care. QTEs describe how the treatment shifts the shape of the distribution. Without the rank invariance assumption, they cannot be interpreted as effects on identifiable individuals or subgroups.

  4. RIF regression provides unconditional effects. For policy purposes, unconditional quantile partial effects (via RIF regression) are often more directly relevant than conditional quantile effects.


Extension Exercises

  1. Conditional quantile treatment effects. Estimate QTEs separately for subgroups defined by prior earnings or education. Does the distributional pattern differ?

  2. Distributional decomposition. Use the Firpo et al. (2009) decomposition to separate the composition effect from the structural effect.

  3. Counterfactual distributions. Construct the entire counterfactual earnings distribution under no treatment and compare with the observed treatment distribution.

  4. Causal forests for heterogeneity. Use a causal forest (Athey and Imbens (2019)) to identify which observable characteristics predict treatment effect heterogeneity. Compare with the QTE approach.

  5. Power analysis. Given the estimated QTEs and their standard errors, compute the minimum detectable effect at each quantile for a study with N = 3,000 vs. N = 10,000.