Lab: Heckman Selection Model from Scratch
Implement the Heckman two-step correction for sample selection bias. Simulate a selection model with correlated errors, estimate the probit selection equation, compute the inverse Mills ratio, and compare corrected estimates with naive OLS.
Overview
Sample selection bias arises whenever we observe the outcome variable only for a non-random subsample. The classic example: we observe wages only for women who choose to work. If the factors driving the participation decision are correlated with unobserved determinants of wages, OLS on the selected sample is biased. Heckman's (1979) two-step estimator corrects for this bias using information from the selection equation.
What you will learn:
- Why OLS on a selected sample is biased and in which direction
- How to specify and estimate the probit selection equation
- How to compute the inverse Mills ratio (IMR) and what it represents
- How to run the Heckman two-step correction and interpret lambda
- Why the exclusion restriction matters for identification
- How to compare corrected estimates with naive OLS
Prerequisites: OLS regression, basic understanding of probit models and maximum likelihood.
Step 1: The Selection Problem
We want to estimate the wage equation for women. The problem: wages are observed only for women who participate in the labor force. If unobserved factors (e.g., motivation, household preferences) affect both the participation decision and wages, OLS on the working subsample is biased.
library(sampleSelection)
library(MASS)
set.seed(42)
n <- 5000
# Covariates
educ <- round(rnorm(n, mean = 12, sd = 3))
exper <- pmax(round(rnorm(n, mean = 15, sd = 8)), 0)
age <- 25 + round(exper * 0.8 + rnorm(n, sd = 3))
nchild <- rpois(n, lambda = 1.5) # Exclusion restriction variable
husb_inc <- pmax(rnorm(n, mean = 30, sd = 15), 0) # Exclusion restriction variable
# Correlated errors: rho = 0.6 (positive selection)
rho <- 0.6
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
errors <- mvrnorm(n, mu = c(0, 0), Sigma = Sigma)
u_select <- errors[, 1] # Selection equation error
u_wage <- errors[, 2] # Wage equation error
# Selection equation (latent): participate if z*gamma + u_s > 0
z_gamma <- -1.5 + 0.15 * educ + 0.02 * exper - 0.3 * nchild - 0.02 * husb_inc
participate <- as.integer(z_gamma + u_select > 0)
cat("Participation rate:", mean(participate), "\n")
# Wage equation (observed only for participants)
true_beta_educ <- 0.08
true_beta_exper <- 0.03
log_wage <- 1.0 + true_beta_educ * educ + true_beta_exper * exper + 0.5 * u_wage
log_wage_obs <- ifelse(participate == 1, log_wage, NA)
df <- data.frame(log_wage = log_wage_obs, educ, exper, age,
nchild, husb_inc, participate)
# Naive OLS on workers only (biased due to selection)
ols_biased <- lm(log_wage ~ educ + exper, data = df, subset = participate == 1)
cat("\nNaive OLS (selected sample):\n")
cat(" educ coeff:", coef(ols_biased)["educ"], "(true:", true_beta_educ, ")\n")
cat(" exper coeff:", coef(ols_biased)["exper"], "(true:", true_beta_exper, ")\n")Expected output:
| Estimator | educ coefficient | exper coefficient | True educ | True exper |
|---|---|---|---|---|
| Naive OLS | ~0.065 | ~0.025 | 0.080 | 0.030 |
The naive OLS coefficients are biased. With positive selection (rho > 0), women who participate tend to have higher unobserved wage potential, creating a non-random sample.
Step 2: Estimate the Probit Selection Equation
The first step of the Heckman procedure is to estimate a probit model for the participation decision. This equation should include all variables in the wage equation plus at least one exclusion restriction — a variable that affects participation but not wages directly.
# Probit selection equation
# Exclusion restrictions: nchild, husb_inc
# These affect participation but not wages directly
probit <- glm(participate ~ educ + exper + nchild + husb_inc,
data = df, family = binomial(link = "probit"))
summary(probit)
cat("\nKey coefficients:\n")
cat(" nchild (excl. restr.):", coef(probit)["nchild"],
"- more children reduce participation\n")
cat(" husb_inc (excl. restr.):", coef(probit)["husb_inc"],
"- higher husband income reduces participation\n")Expected output:
| Variable | Probit Coefficient | Interpretation |
|---|---|---|
| educ | ~0.15 | More education increases participation |
| exper | ~0.02 | More experience increases participation |
| nchild | ~-0.30 | More children reduce participation (exclusion restriction) |
| husb_inc | ~-0.02 | Higher husband income reduces participation (exclusion restriction) |
The exclusion restriction variables (nchild, husb_inc) are significant predictors of participation. These variables must plausibly have no direct effect on wages — they affect wages only through their effect on the participation decision.
Why is the exclusion restriction important in the Heckman model? What happens if the selection equation includes exactly the same variables as the wage equation?
Step 3: Compute the Inverse Mills Ratio
The inverse Mills ratio (IMR) is the key correction term. For each observation, it measures the expected value of the truncated selection error, conditional on being selected into the sample.
# Predicted index from the probit (X * gamma_hat)
xb <- predict(probit, type = "link")
# Inverse Mills ratio: lambda = phi(xb) / Phi(xb)
# phi = standard normal PDF, Phi = standard normal CDF
imr <- dnorm(xb) / pnorm(xb)
df$imr <- imr
df$xb <- xb
cat("IMR summary for participants:\n")
summary(imr[df$participate == 1])
cat("\nIMR summary for non-participants:\n")
summary(imr[df$participate == 0])
cat("\nIMR is larger for observations near the selection margin\n")
cat("(where participation probability is lower).\n")Expected output:
The inverse Mills ratio is larger for observations near the participation margin (lower predicted probability of participation). For participants with very high predicted probability, the IMR is close to zero — there is little selection correction needed because these individuals would almost certainly participate regardless.
Step 4: Heckman Two-Step Estimation
Now run the second step: add the inverse Mills ratio as an additional regressor in the wage equation. The coefficient on the IMR (often called lambda) captures the selection bias.
# Step 2: Add IMR to the wage equation (participants only)
heckman_manual <- lm(log_wage ~ educ + exper + imr,
data = df, subset = participate == 1)
cat("=== Heckman Two-Step (Manual) ===\n")
cat("educ coeff:", coef(heckman_manual)["educ"],
"(true:", true_beta_educ, ")\n")
cat("exper coeff:", coef(heckman_manual)["exper"],
"(true:", true_beta_exper, ")\n")
cat("lambda (IMR coeff):", coef(heckman_manual)["imr"], "\n")
cat("\nNote: SEs from manual two-step are incorrect.\n")
cat("Use the sampleSelection package for correct SEs.\n")
# Proper Heckman using sampleSelection package
heck <- selection(participate ~ educ + exper + nchild + husb_inc,
log_wage ~ educ + exper, data = df)
summary(heck)
cat("\n=== Comparison ===\n")
cat("Naive OLS educ:", coef(ols_biased)["educ"], "\n")
cat("Heckman educ: ", coef(heck, part = "outcome")["educ"], "\n")
cat("True educ: ", true_beta_educ, "\n")Expected output:
| Estimator | educ | exper | lambda (IMR) |
|---|---|---|---|
| Naive OLS | ~0.065 | ~0.025 | — |
| Heckman two-step | ~0.080 | ~0.030 | ~0.30 |
| True values | 0.080 | 0.030 | — |
The Heckman-corrected coefficients are closer to the true values. The positive and significant lambda indicates positive selection: women who choose to work have higher unobserved wage potential than non-workers.
The estimated lambda (coefficient on the inverse Mills ratio) is positive and statistically significant. What does this tell us about selection?
Step 5: Compare with Naive OLS
Let us visualize the difference between naive OLS and Heckman-corrected estimates, and understand when the correction matters most.
# Full comparison table
cat("========================================\n")
cat(" Naive OLS vs. Heckman \n")
cat("========================================\n")
cat(sprintf("%-15s %10s %10s %10s\n", "Parameter", "OLS", "Heckman", "True"))
cat(sprintf("%-15s %10.4f %10.4f %10.4f\n", "educ",
coef(ols_biased)["educ"],
coef(heck, part = "outcome")["educ"],
true_beta_educ))
cat(sprintf("%-15s %10.4f %10.4f %10.4f\n", "exper",
coef(ols_biased)["exper"],
coef(heck, part = "outcome")["exper"],
true_beta_exper))
# Extract rho and sigma from the Heckman model
cat("\nSelection parameters:\n")
cat(" rho (error correlation):", heck$rho, "\n")
cat(" sigma (wage error SD):", heck$sigma, "\n")
cat(" lambda = rho * sigma:", heck$rho * heck$sigma, "\n")
# OLS on the FULL sample (if wages were observed for everyone)
# This is the benchmark we would get without selection
ols_full <- lm(log_wage ~ educ + exper,
data = data.frame(log_wage = log_wage, educ, exper))
cat("\nOLS on full population (no selection):\n")
cat(" educ:", coef(ols_full)["educ"], "\n")
cat(" exper:", coef(ols_full)["exper"], "\n")Expected output:
| Estimator | educ | exper | lambda |
|---|---|---|---|
| OLS full population | ~0.080 | ~0.030 | — |
| Naive OLS (workers only) | ~0.065 | ~0.025 | — |
| Heckman two-step | ~0.080 | ~0.030 | ~0.30 |
| True values | 0.080 | 0.030 | — |
The Heckman estimator recovers coefficients close to the true values and close to what we would get if we could observe wages for the entire population. The naive OLS on workers is biased because it ignores selection.
Step 6: Check Exclusion Restriction Strength
The exclusion restriction is what gives the Heckman model practical identification beyond functional form. Let us test what happens when the exclusion restrictions are weak.
# Test 1: Joint significance of exclusion restrictions in probit
probit_full <- glm(participate ~ educ + exper + nchild + husb_inc,
data = df, family = binomial(link = "probit"))
probit_restricted <- glm(participate ~ educ + exper,
data = df, family = binomial(link = "probit"))
lr_test <- anova(probit_restricted, probit_full, test = "Chisq")
cat("LR test for exclusion restrictions:\n")
print(lr_test)
# Test 2: What happens WITHOUT exclusion restrictions
cat("\n=== Heckman WITHOUT exclusion restrictions ===\n")
cat("(Identification from functional form only)\n")
heck_no_excl <- selection(participate ~ educ + exper,
log_wage ~ educ + exper, data = df)
cat("educ (no excl. restr.):",
coef(heck_no_excl, part = "outcome")["educ"], "\n")
cat("educ (with excl. restr.):",
coef(heck, part = "outcome")["educ"], "\n")
cat("educ (true):", true_beta_educ, "\n")
cat("\nWithout exclusion restrictions, estimates are fragile.\n")Expected output:
| Scenario | educ estimate | Stable? |
|---|---|---|
| With exclusion restrictions | ~0.080 | Yes |
| Without exclusion restrictions | Unstable | No — identified only through functional form |
| True value | 0.080 | — |
The LR test should strongly reject the null that the exclusion restriction variables have zero coefficients in the selection equation. Without these variables, the Heckman model relies on functional form alone and produces fragile estimates.
Step 7: Sensitivity Analysis
Check how the Heckman estimates change under different assumptions about the correlation between errors.
# Simulate data with different rho values
rho_values <- c(-0.5, 0, 0.3, 0.6, 0.9)
results <- data.frame(rho = numeric(), ols_educ = numeric(),
heck_educ = numeric(), lambda = numeric())
for (r in rho_values) {
Sig <- matrix(c(1, r, r, 1), 2, 2)
errs <- mvrnorm(n, mu = c(0, 0), Sigma = Sig)
part_r <- as.integer(z_gamma + errs[, 1] > 0)
lw_r <- 1.0 + true_beta_educ * educ + true_beta_exper * exper + 0.5 * errs[, 2]
lw_obs <- ifelse(part_r == 1, lw_r, NA)
df_r <- data.frame(log_wage = lw_obs, educ, exper, nchild, husb_inc,
participate = part_r)
ols_r <- lm(log_wage ~ educ + exper, data = df_r, subset = participate == 1)
heck_r <- selection(participate ~ educ + exper + nchild + husb_inc,
log_wage ~ educ + exper, data = df_r)
results <- rbind(results, data.frame(
rho = r,
ols_educ = coef(ols_r)["educ"],
heck_educ = coef(heck_r, part = "outcome")["educ"],
lambda = heck_r$rho * heck_r$sigma
))
}
cat("Sensitivity to selection strength (rho):\n")
print(round(results, 4))
cat("\nTrue educ coefficient:", true_beta_educ, "\n")Expected output:
| rho | OLS educ | Heckman educ | lambda |
|---|---|---|---|
| -0.5 | ~0.095 | ~0.080 | ~-0.25 |
| 0.0 | ~0.080 | ~0.080 | ~0.00 |
| 0.3 | ~0.072 | ~0.080 | ~0.15 |
| 0.6 | ~0.065 | ~0.080 | ~0.30 |
| 0.9 | ~0.055 | ~0.080 | ~0.45 |
When rho = 0 (no selection), OLS and Heckman give the same result. As rho increases, OLS bias grows, but the Heckman estimator consistently recovers the true coefficient.
Step 8: Exercises
Interpreting Lambda and Rho
You estimate a Heckman model of wages for a sample of employed individuals. The coefficient on the inverse Mills ratio (lambda) is 0.35 with a standard error of 0.08. The estimated sigma (standard deviation of the wage equation error) is 0.50.
-
Different exclusion restrictions. Replace nchild and husb_inc with other variables (e.g., local childcare availability, distance to nearest employer). How do the estimates change? What makes a good exclusion restriction?
-
Heckman MLE vs. two-step. The maximum likelihood estimator jointly estimates the selection and outcome equations. Compare MLE and two-step estimates. MLE is more efficient but more sensitive to distributional assumptions.
-
Non-normal errors. What happens if the errors are not bivariate normal? Try simulating with t-distributed errors and see how the Heckman estimator performs.
✓Key Takeaways
- Sample selection bias arises when the outcome is observed only for a non-random subsample, and the selection process is correlated with unobserved determinants of the outcome
- The Heckman two-step estimator corrects for selection bias by including the inverse Mills ratio as an additional regressor in the outcome equation
- The inverse Mills ratio captures the expected value of the truncated selection error, conditional on being selected
- Lambda (the coefficient on the IMR) equals rho times sigma, where rho is the correlation between selection and outcome errors
- A credible exclusion restriction is essential: without it, the model is identified only through fragile functional form assumptions
- The sign of lambda indicates the direction of selection: positive lambda means positive selection (selected individuals have higher unobserved outcome potential)
- Always report the exclusion restriction, its economic justification, and the significance of lambda