Lab·replication·9 min read

replication120 minutes

Replication Lab: Returns to College with Marginal Treatment Effects

Replicate the MTE analysis from Carneiro, Heckman, and Vytlacil (2011). Estimate a probit first stage, trace the parametric MTE curve, test for essential heterogeneity, compare ATE/ATT/LATE, and compute the policy-relevant treatment effect (PRTE) for a college expansion.

MethodMarginal Treatment Effects (MTE)

LanguagesPython, R, Stata

DatasetSimulated NLSY-style data with college proximity instrument

Overview

In this replication lab, you will reproduce the key results from:

Carneiro, Pedro, James J. Heckman, and Edward J. Vytlacil. 2011. "Estimating Marginal Returns to Education." American Economic Review 101(6): 2754–2781. DOI: 10.1257/aer.101.6.2754

Carneiro et al. (2011) use the MTE framework to estimate how returns to college vary across individuals. Using college proximity as an instrument, they find a declining MTE curve — individuals most likely to attend college benefit the most — and demonstrate that conventional treatment effect parameters (ATE, ATT, LATE) differ substantially due to essential heterogeneity.

Why this paper matters: It provided the first empirical implementation of the MTE framework for estimating returns to education, showing that standard IV estimates (LATE) do not generalize to the population. The paper demonstrated that policy-relevant treatment effects (PRTE) for college expansions are lower than LATE, because the marginal students induced to attend by a new policy have lower returns than current compliers.

What you will do:

Simulate NLSY-style data with college proximity as an instrument
Estimate a probit first stage for college attendance
Compute the parametric MTE curve
Test for essential heterogeneity
Compare ATE, ATT, and LATE
Compute the PRTE for a hypothetical college expansion

Step 1: Simulate NLSY-Style Data

1library(MASS)
2
3set.seed(2011)
4n <- 12000
5
6# Background covariates (matching NLSY structure)
7# AFQT score (cognitive ability)
8afqt <- rnorm(n)
9# Mother's education (years)
10mom_educ <- pmin(pmax(round(12 + 2 * rnorm(n)), 6), 20)
11# Family income (log)
12log_fam_inc <- 10 + 0.3 * afqt + 0.1 * mom_educ + rnorm(n, 0, 0.5)
13# Number of siblings
14n_siblings <- rpois(n, 2)
15# Urban residence
16urban <- rbinom(n, 1, 0.7)
17
18# Instrument: proximity to a four-year college
19# (distance in 10-mile units, lower = closer)
20# Exogenous after conditioning on covariates
21college_nearby <- rbinom(n, 1, 0.6 + 0.1 * urban - 0.05 * n_siblings)
22
23# Unobserved heterogeneity
24# (U_S, U_1, U_0) are correlated: selection on gains
25Sigma <- matrix(c(1.0, 0.5, 0.2,
26                 0.5, 1.0, 0.3,
27                 0.2, 0.3, 1.0), 3, 3)
28U <- mvrnorm(n, mu = c(0, 0, 0), Sigma = Sigma)
29U_S <- U[, 1]  # Unobserved in selection equation
30U_1 <- U[, 2]  # Unobserved in treated outcome
31U_0 <- U[, 3]  # Unobserved in untreated outcome
32
33# Selection equation (probit model)
34# College attendance decision
35gamma_S <- c(-0.5, 0.40, 0.05, 0.10, -0.05, 0.15, 0.50)
36# Intercept, AFQT, mom_educ, log_fam_inc, n_siblings, urban, college_nearby
37latent_S <- gamma_S[1] + gamma_S[2]*afqt + gamma_S[3]*mom_educ +
38gamma_S[4]*log_fam_inc + gamma_S[5]*n_siblings +
39gamma_S[6]*urban + gamma_S[7]*college_nearby - U_S
40
41D <- as.integer(latent_S > 0)
42
43# Outcome equations (log hourly wages at age 30)
44# Y(0): earnings without college
45# Y(1): earnings with college
46alpha_0 <- c(2.0, 0.15, 0.02, 0.10, -0.02, 0.05)
47alpha_1 <- c(2.3, 0.20, 0.03, 0.08, -0.01, 0.08)
48
49Y0 <- alpha_0[1] + alpha_0[2]*afqt + alpha_0[3]*mom_educ +
50alpha_0[4]*log_fam_inc + alpha_0[5]*n_siblings +
51alpha_0[6]*urban + 0.3*U_0
52
53Y1 <- alpha_1[1] + alpha_1[2]*afqt + alpha_1[3]*mom_educ +
54alpha_1[4]*log_fam_inc + alpha_1[5]*n_siblings +
55alpha_1[6]*urban + 0.3*U_1
56
57# Individual treatment effect
58beta_i <- Y1 - Y0
59
60# Observed outcome
61Y <- D * Y1 + (1 - D) * Y0
62
63df <- data.frame(Y, D, afqt, mom_educ, log_fam_inc, n_siblings,
64               urban, college_nearby, beta_i)
65
66cat("=== NLSY-Style Data Summary ===\n")
67cat("N:", n, "\n")
68cat("College attendance rate:", round(mean(D), 3), "\n")
69cat("Mean log wage:", round(mean(Y), 3), "\n\n")
70
71cat("True ATE:", round(mean(beta_i), 3), "\n")
72cat("True ATT:", round(mean(beta_i[D == 1]), 3), "\n")
73cat("True ATU:", round(mean(beta_i[D == 0]), 3), "\n\n")
74
75cat("Published estimates (CHV 2011):\n")
76cat("  ATE: ~0.04-0.07\n")
77cat("  ATT: ~0.13-0.15\n")
78cat("  LATE: ~0.09-0.12\n")

RequiresMASS

Expected output:

Statistic	Value
N	12,000
College attendance rate	~0.55
Mean log wage	~3.1
True ATE	~0.35–0.45
True ATT	~0.45–0.55
True ATU	~0.25–0.35

Step 2: Probit First Stage

1# Probit model: college attendance on covariates + instrument
2probit <- glm(D ~ afqt + mom_educ + log_fam_inc + n_siblings +
3              urban + college_nearby,
4            data = df, family = binomial(link = "probit"))
5
6cat("=== Probit First Stage ===\n")
7print(summary(probit)$coefficients)
8
9df$phat <- predict(probit, type = "response")
10
11cat("\nPropensity score summary:\n")
12cat("  Min:", round(min(df$phat), 3), "\n")
13cat("  Max:", round(max(df$phat), 3), "\n")
14cat("  Mean:", round(mean(df$phat), 3), "\n")
15cat("  Median:", round(median(df$phat), 3), "\n\n")
16
17# Instrument relevance
18cat("Instrument (college_nearby):\n")
19cat("  Coefficient:", round(coef(probit)["college_nearby"], 4), "\n")
20cat("  z-statistic:", round(summary(probit)$coefficients["college_nearby", "z value"], 2), "\n")
21cat("  (Strong instrument: |z| >> 2)\n\n")
22
23# Published first stage
24cat("Published: College proximity increases attendance by ~12-15 pp\n")
25# Marginal effect at the mean
26marginal_eff <- dnorm(predict(probit, type = "link")) * coef(probit)["college_nearby"]
27cat("Our marginal effect at mean:", round(mean(marginal_eff), 3), "\n")

Expected output:

Variable	Coefficient	SE	z-stat
afqt	~0.35	~0.02	~17
mom_educ	~0.04	~0.01	~4
log_fam_inc	~0.08	~0.03	~3
n_siblings	~-0.04	~0.01	~-3
urban	~0.12	~0.04	~3
college_nearby	~0.42	~0.03	~14

Statistic	Value
Propensity score range	[~0.05, ~0.95]
Marginal effect of proximity	~0.12–0.16

Step 3: Parametric MTE Estimation

1# Parametric MTE: polynomial in P(Z)
2# E[Y | X, P] = X'alpha + K(P)
3# MTE(u | X) = K'(u)
4
5df$phat2 <- df$phat^2
6df$phat3 <- df$phat^3
7
8# Quadratic specification (main)
9mte_quad <- lm(Y ~ afqt + mom_educ + log_fam_inc + n_siblings +
10               urban + phat + phat2, data = df)
11
12cat("=== MTE Regression (Quadratic) ===\n")
13cat("K(P) = ", round(coef(mte_quad)["phat"], 4), "* P + ",
14  round(coef(mte_quad)["phat2"], 4), "* P^2\n\n")
15
16# MTE(u) = dK/dp = beta1 + 2*beta2*u
17b1 <- coef(mte_quad)["phat"]
18b2 <- coef(mte_quad)["phat2"]
19
20u_grid <- seq(0.05, 0.95, by = 0.05)
21mte_vals <- b1 + 2 * b2 * u_grid
22
23cat("=== Estimated MTE Curve ===\n")
24cat(sprintf("%-8s %-12s\n", "u_D", "MTE(u)"))
25for (i in seq_along(u_grid)) {
26cat(sprintf("%-8.2f %-12.4f\n", u_grid[i], mte_vals[i]))
27}
28
29# Compare polynomial orders
30cat("\n=== Sensitivity to Polynomial Order ===\n")
31for (order in 1:3) {
32formula <- "Y ~ afqt + mom_educ + log_fam_inc + n_siblings + urban"
33for (j in 1:order) {
34  df[[paste0("p", j)]] <- df$phat^j
35  formula <- paste0(formula, " + p", j)
36}
37m <- lm(as.formula(formula), data = df)
38
39# MTE at u = 0.5
40mte_mid <- 0
41for (j in 1:order) {
42  mte_mid <- mte_mid + j * coef(m)[paste0("p", j)] * 0.5^(j - 1)
43}
44cat(sprintf("Order %d: MTE(0.5) = %.4f\n", order, mte_mid))
45}

Expected output:

u_D	MTE (estimated)
0.10	~0.55
0.30	~0.45
0.50	~0.38
0.70	~0.30
0.90	~0.22

Polynomial Order	MTE(0.5)
Linear	~0.38
Quadratic	~0.38
Cubic	~0.38

The MTE curve declines from approximately 0.55 at u = 0.10 to approximately 0.22 at u = 0.90, matching the qualitative pattern in the published paper. The MTE at u = 0.50 is stable across polynomial orders.

Step 4: Test for Essential Heterogeneity

1# F-test: is the coefficient on P^2 significant?
2cat("=== Essential Heterogeneity Test ===\n")
3cat("H0: MTE is constant (no essential heterogeneity)\n\n")
4
5# Restricted model (linear in P)
6restricted <- lm(Y ~ afqt + mom_educ + log_fam_inc + n_siblings +
7                 urban + phat, data = df)
8# Unrestricted model (quadratic in P)
9unrestricted <- lm(Y ~ afqt + mom_educ + log_fam_inc + n_siblings +
10                   urban + phat + phat2, data = df)
11
12f_test <- anova(restricted, unrestricted)
13cat("F-statistic:", round(f_test$F[2], 2), "\n")
14cat("p-value:", round(f_test$"Pr(>F)"[2], 6), "\n")
15cat("Decision:", ifelse(f_test$"Pr(>F)"[2] < 0.05,
16  "REJECT H0: Essential heterogeneity is present",
17  "Fail to reject H0"), "\n\n")
18
19# Published: "We reject the hypothesis of no essential heterogeneity"
20cat("Published result: Reject at 5% level\n")
21cat("Implication: LATE != ATE, different IVs give different LATEs\n")

Expected output:

Test	F-statistic	p-value	Published
H0: flat MTE	~10–40	< 0.001	Reject at 5%

The essential heterogeneity test strongly rejects the null, confirming that the MTE is non-flat and that conventional treatment effect parameters differ.

Concept Check

The essential heterogeneity test rejects (p < 0.001). What are the practical implications for a policy evaluation of college subsidies?

The rejection means college subsidies are ineffective because the MTE declines.The rejection means we cannot use any IV estimate for policy because all IV estimates are biased.The rejection means the LATE from any given instrument should not be extrapolated to the population or to a different policy, because the marginal students affected by a new policy may have different returns than current compliers.The rejection means we should use OLS instead of IV because IV estimates are unreliable under essential heterogeneity.

Step 5: ATE, ATT, LATE Comparison

1# ATE: uniform weights over [0, 1]
2# For quadratic MTE: ATE = beta1 + beta2
3ate <- b1 + b2
4
5# ATT: overweights low u (eager participants)
6u_fine <- seq(0.001, 0.999, length.out = 1000)
7mte_fine <- b1 + 2 * b2 * u_fine
8p_vals <- df$phat
9
10att_w <- sapply(u_fine, function(u) mean(p_vals > u)) / mean(p_vals)
11att <- sum(mte_fine * att_w) / sum(att_w)
12
13# ATU: overweights high u (reluctant non-participants)
14atu_w <- sapply(u_fine, function(u) mean(p_vals <= u)) / mean(1 - p_vals)
15atu <- sum(mte_fine * atu_w) / sum(atu_w)
16
17# LATE: average MTE over complier region for proximity IV
18p_no_prox <- mean(df$phat[df$college_nearby == 0])
19p_prox <- mean(df$phat[df$college_nearby == 1])
20late_u <- seq(p_no_prox, p_prox, length.out = 200)
21late_mte <- b1 + 2 * b2 * late_u
22late <- mean(late_mte)
23
24cat("=== Treatment Effect Comparison ===\n")
25cat(sprintf("%-25s %-12s %-12s\n", "Parameter", "Estimated", "Published"))
26cat(sprintf("%-25s %-12.3f %-12s\n", "ATE", ate, "~0.04-0.07"))
27cat(sprintf("%-25s %-12.3f %-12s\n", "ATT", att, "~0.13-0.15"))
28cat(sprintf("%-25s %-12.3f %-12s\n", "ATU", atu, "~0.01-0.03"))
29cat(sprintf("%-25s %-12.3f %-12s\n", "LATE (proximity)", late, "~0.09-0.12"))
30cat("\nNote: Our simulation has larger effect sizes than the published\n")
31cat("paper. The qualitative ordering ATT > LATE > ATE > ATU is the\n")
32cat("key finding to replicate.\n\n")
33
34# True values from the DGP
35cat("=== True Values (known from DGP) ===\n")
36cat("True ATE:", round(mean(df$beta_i), 3), "\n")
37cat("True ATT:", round(mean(df$beta_i[df$D == 1]), 3), "\n")
38cat("True ATU:", round(mean(df$beta_i[df$D == 0]), 3), "\n")

Expected output:

Parameter	Estimated	Published (CHV 2011)
ATE	~0.40	~0.04–0.07
ATT	~0.50	~0.13–0.15
ATU	~0.30	~0.01–0.03
LATE (proximity)	~0.43	~0.09–0.12

The qualitative ordering ATT > LATE > ATE > ATU is the key finding to replicate. The absolute magnitudes differ because our simulation uses a stylized DGP with larger treatment effects; the published results use real NLSY data with more covariates and a noisier outcome.

Step 6: Policy-Relevant Treatment Effect (PRTE)

The PRTE answers: "What would be the average return to college for students induced to attend by a specific policy change?"

1# PRTE: Effect of a hypothetical college expansion
2# Suppose a new tuition subsidy increases the propensity score by 0.10
3# for everyone (capped at 1)
4
5df$phat_new <- pmin(df$phat + 0.10, 1)
6
7# New treatment assignment (counterfactual)
8D_new <- rbinom(n, 1, df$phat_new)
9
10# PRTE weights: for MTE at u, weight = [F_new(u) - F_old(u)] / [E[D_new] - E[D]]
11delta_D <- mean(D_new) - mean(D)
12
13prte_weights <- sapply(u_fine, function(u) {
14(mean(df$phat_new > u) - mean(df$phat > u)) / delta_D
15})
16
17prte <- sum(mte_fine * prte_weights) / sum(prte_weights)
18
19# Alternatively, PRTE = avg MTE over new compliers
20# These are individuals with phat < u < phat_new
21cat("=== Policy-Relevant Treatment Effect ===\n")
22cat("Policy: Tuition subsidy increasing P(Z) by 0.10 for all\n\n")
23cat("PRTE:", round(prte, 3), "\n")
24cat("ATT:", round(att, 3), "\n")
25cat("LATE:", round(late, 3), "\n")
26cat("ATE:", round(ate, 3), "\n\n")
27
28cat("The PRTE (", round(prte, 3), ") is LOWER than LATE (",
29  round(late, 3), ")\n")
30cat("because the subsidy targets the margin of students who are\n")
31cat("not currently attending — these marginal students have lower\n")
32cat("returns than current compliers.\n\n")
33
34cat("Published finding: PRTE for tuition subsidies is lower than LATE,\n")
35cat("implying diminishing returns to expanding college access.\n")
36
37# Compare PRTE for different policy sizes
38cat("\n=== PRTE by Policy Size ===\n")
39for (shift in c(0.05, 0.10, 0.15, 0.20)) {
40p_new <- pmin(df$phat + shift, 1)
41delta <- mean(p_new) - mean(df$phat)
42# Average MTE over the newly treated margin
43prte_w <- sapply(u_fine, function(u) {
44  (mean(p_new > u) - mean(df$phat > u)) / delta
45})
46prte_s <- sum(mte_fine * prte_w) / sum(prte_w)
47cat(sprintf("Shift = +%.2f: PRTE = %.3f\n", shift, prte_s))
48}
49cat("(Larger expansions have lower PRTE: diminishing returns)\n")

Expected output:

Parameter	Estimate
PRTE (subsidy +0.10)	~0.35
LATE (proximity IV)	~0.43
ATT	~0.50
ATE	~0.40

Policy Size (P shift)	PRTE
+0.05	~0.38
+0.10	~0.35
+0.15	~0.33
+0.20	~0.31

The PRTE is lower than LATE because the subsidy targets students on the margin of college attendance — those with higher unobserved resistance and lower returns. Larger expansions target increasingly reluctant students with even lower returns, producing diminishing marginal policy returns.

Concept Check

The PRTE declines from 0.38 to 0.31 as the subsidy becomes more generous. What does this imply about the marginal value of expanding college access?

Expanding access is never worthwhile because the PRTE is below the ATT.Each successive expansion targets students with lower returns to college, implying diminishing marginal returns. The optimal subsidy size depends on comparing the PRTE to the per-student cost.The PRTE decline means the probit model is misspecified, and we should use a logit instead.The decline is an artifact of capping the propensity score at 1, which compresses the distribution.

Step 7: Error Detective

Error Detective

Read the analysis below carefully and identify the errors.

An education economist uses MTE to evaluate a proposed expansion of community college access. She estimates the propensity score using a probit with 15 covariates and a "presence of community college in county" instrument. She reports:

"The propensity score ranges from 0.35 to 0.65. We estimate a quadratic MTE curve. The MTE at u_D = 0.35 is 0.28, and at u_D = 0.65 it is 0.22. The essential heterogeneity test is insignificant (F = 1.8, p = 0.18). We compute ATE = 0.25, ATT = 0.27, LATE = 0.26, and PRTE = 0.19 for a nationwide community college expansion that would shift propensity scores by 0.20.

We conclude that the expansion would generate substantial returns (PRTE = 0.19) and recommend proceeding."

She does not discuss the limited propensity score support or the implications of the PRTE relying on extrapolation.

Select all errors you can find:

The PRTE computation relies heavily on extrapolation beyond the propensity score support(PRTE computation)

The insignificant essential heterogeneity test undermines the need for MTE(Essential heterogeneity test interpretation)

Narrow propensity score range (0.35 to 0.65) limits the MTE to 30% of the unit interval(Propensity score support discussion)

Summary

Our replication confirms the central findings of Carneiro et al. (2011):

The MTE curve declines with unobserved resistance. Individuals who are most likely to attend college (low u_D) have the highest returns. This pattern is consistent with the Roy model of comparative advantage.
Essential heterogeneity is present. The test strongly rejects a flat MTE, confirming that ATE, ATT, and LATE differ and that standard IV should not be naively extrapolated.
ATT exceeds LATE, which exceeds ATE. This ordering follows from the declining MTE and the different weight functions. ATT overweights eager participants (high MTE), while ATE weights uniformly.
The PRTE for college expansions is below LATE. Marginal students induced to attend by subsidies have lower returns than current compliers. Larger expansions produce diminishing marginal returns.
The MTE framework provides a principled basis for policy extrapolation — something standard IV cannot do when treatment effects are heterogeneous.

Extension Exercises

Semiparametric MTE. Instead of a polynomial in P, estimate the MTE using local polynomial smoothing of the relationship between Y and P(Z). Compare the semiparametric MTE curve to the parametric one.
Bounds approach. Use the Mogstad et al. (2018) bounds approach to compute partial identification bounds on ATE when the propensity score support is limited.
Multiple instruments. Add a second instrument (tuition level) to the simulation. Show that different instruments produce different LATE estimates when MTE is non-flat.
Discrete instrument. Replace the continuous instrument with a binary one (college nearby yes/no). How does the discrete instrument affect the propensity score range and MTE identification?
Normal selection model. Estimate the MTE using the Heckman and Vytlacil (2005) normal selection model instead of the polynomial approach. Compare the two MTE curves.

Overview#

Step 1: Simulate NLSY-Style Data#

Step 2: Probit First Stage#

Step 3: Parametric MTE Estimation#

Step 4: Test for Essential Heterogeneity#

Step 5: ATE, ATT, LATE Comparison#

Step 6: Policy-Relevant Treatment Effect (PRTE)#

Step 7: Error Detective#

Summary#

Extension Exercises#

Overview

Step 1: Simulate NLSY-Style Data

Step 2: Probit First Stage

Step 3: Parametric MTE Estimation

Step 4: Test for Essential Heterogeneity

Step 5: ATE, ATT, LATE Comparison

Step 6: Policy-Relevant Treatment Effect (PRTE)

Step 7: Error Detective

Summary

Extension Exercises