MethodAtlas
Lab·tutorial·10 min read
tutorial90 minutes

Lab: Heckman Selection Model from Scratch

Implement the Heckman two-step correction for sample selection bias. Simulate a selection model with correlated errors, estimate the probit selection equation, compute the inverse Mills ratio, and compare corrected estimates with naive OLS.

LanguagesPython, R, Stata
DatasetFemale labor force participation (simulated matching Mroz 1987)

Overview

Sample selection bias arises whenever we observe the outcome variable only for a non-random subsample. The classic example: we observe wages only for women who choose to work. If the factors driving the participation decision are correlated with unobserved determinants of wages, OLS on the selected sample is biased. Heckman's (1979) two-step estimator corrects for this bias using information from the selection equation.

What you will learn:

  • Why OLS on a selected sample is biased and in which direction
  • How to specify and estimate the probit selection equation
  • How to compute the inverse Mills ratio (IMR) and what it represents
  • How to run the Heckman two-step correction and interpret lambda
  • Why the exclusion restriction matters for identification
  • How to compare corrected estimates with naive OLS

Prerequisites: OLS regression, basic understanding of probit models and maximum likelihood.


Step 1: The Selection Problem

We want to estimate the wage equation for women. The problem: wages are observed only for women who participate in the labor force. If unobserved factors (e.g., motivation, household preferences) affect both the participation decision and wages, OLS on the working subsample is biased.

library(sampleSelection)
library(MASS)

set.seed(42)
n <- 5000

# Covariates
educ <- round(rnorm(n, mean = 12, sd = 3))
exper <- pmax(round(rnorm(n, mean = 15, sd = 8)), 0)
age <- 25 + round(exper * 0.8 + rnorm(n, sd = 3))
nchild <- rpois(n, lambda = 1.5)  # Exclusion restriction variable
husb_inc <- pmax(rnorm(n, mean = 30, sd = 15), 0)  # Exclusion restriction variable

# Correlated errors: rho = 0.6 (positive selection)
rho <- 0.6
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
errors <- mvrnorm(n, mu = c(0, 0), Sigma = Sigma)
u_select <- errors[, 1]   # Selection equation error
u_wage   <- errors[, 2]   # Wage equation error

# Selection equation (latent): participate if z*gamma + u_s > 0
z_gamma <- -1.5 + 0.15 * educ + 0.02 * exper - 0.3 * nchild - 0.02 * husb_inc
participate <- as.integer(z_gamma + u_select > 0)
cat("Participation rate:", mean(participate), "\n")

# Wage equation (observed only for participants)
true_beta_educ  <- 0.08
true_beta_exper <- 0.03
log_wage <- 1.0 + true_beta_educ * educ + true_beta_exper * exper + 0.5 * u_wage
log_wage_obs <- ifelse(participate == 1, log_wage, NA)

df <- data.frame(log_wage = log_wage_obs, educ, exper, age,
               nchild, husb_inc, participate)

# Naive OLS on workers only (biased due to selection)
ols_biased <- lm(log_wage ~ educ + exper, data = df, subset = participate == 1)
cat("\nNaive OLS (selected sample):\n")
cat("  educ coeff:", coef(ols_biased)["educ"], "(true:", true_beta_educ, ")\n")
cat("  exper coeff:", coef(ols_biased)["exper"], "(true:", true_beta_exper, ")\n")
RequiresMASS

Expected output:

Estimatoreduc coefficientexper coefficientTrue educTrue exper
Naive OLS~0.065~0.0250.0800.030

The naive OLS coefficients are biased. With positive selection (rho > 0), women who participate tend to have higher unobserved wage potential, creating a non-random sample.


Step 2: Estimate the Probit Selection Equation

The first step of the Heckman procedure is to estimate a probit model for the participation decision. This equation should include all variables in the wage equation plus at least one exclusion restriction — a variable that affects participation but not wages directly.

# Probit selection equation
# Exclusion restrictions: nchild, husb_inc
# These affect participation but not wages directly
probit <- glm(participate ~ educ + exper + nchild + husb_inc,
            data = df, family = binomial(link = "probit"))

summary(probit)

cat("\nKey coefficients:\n")
cat("  nchild (excl. restr.):", coef(probit)["nchild"],
  "- more children reduce participation\n")
cat("  husb_inc (excl. restr.):", coef(probit)["husb_inc"],
  "- higher husband income reduces participation\n")

Expected output:

VariableProbit CoefficientInterpretation
educ~0.15More education increases participation
exper~0.02More experience increases participation
nchild~-0.30More children reduce participation (exclusion restriction)
husb_inc~-0.02Higher husband income reduces participation (exclusion restriction)

The exclusion restriction variables (nchild, husb_inc) are significant predictors of participation. These variables must plausibly have no direct effect on wages — they affect wages only through their effect on the participation decision.

Concept Check

Why is the exclusion restriction important in the Heckman model? What happens if the selection equation includes exactly the same variables as the wage equation?


Step 3: Compute the Inverse Mills Ratio

The inverse Mills ratio (IMR) is the key correction term. For each observation, it measures the expected value of the truncated selection error, conditional on being selected into the sample.

# Predicted index from the probit (X * gamma_hat)
xb <- predict(probit, type = "link")

# Inverse Mills ratio: lambda = phi(xb) / Phi(xb)
# phi = standard normal PDF, Phi = standard normal CDF
imr <- dnorm(xb) / pnorm(xb)

df$imr <- imr
df$xb  <- xb

cat("IMR summary for participants:\n")
summary(imr[df$participate == 1])
cat("\nIMR summary for non-participants:\n")
summary(imr[df$participate == 0])
cat("\nIMR is larger for observations near the selection margin\n")
cat("(where participation probability is lower).\n")

Expected output:

The inverse Mills ratio is larger for observations near the participation margin (lower predicted probability of participation). For participants with very high predicted probability, the IMR is close to zero — there is little selection correction needed because these individuals would almost certainly participate regardless.


Step 4: Heckman Two-Step Estimation

Now run the second step: add the inverse Mills ratio as an additional regressor in the wage equation. The coefficient on the IMR (often called lambda) captures the selection bias.

# Step 2: Add IMR to the wage equation (participants only)
heckman_manual <- lm(log_wage ~ educ + exper + imr,
                   data = df, subset = participate == 1)

cat("=== Heckman Two-Step (Manual) ===\n")
cat("educ coeff:", coef(heckman_manual)["educ"],
  "(true:", true_beta_educ, ")\n")
cat("exper coeff:", coef(heckman_manual)["exper"],
  "(true:", true_beta_exper, ")\n")
cat("lambda (IMR coeff):", coef(heckman_manual)["imr"], "\n")
cat("\nNote: SEs from manual two-step are incorrect.\n")
cat("Use the sampleSelection package for correct SEs.\n")

# Proper Heckman using sampleSelection package
heck <- selection(participate ~ educ + exper + nchild + husb_inc,
                log_wage ~ educ + exper, data = df)
summary(heck)

cat("\n=== Comparison ===\n")
cat("Naive OLS educ:", coef(ols_biased)["educ"], "\n")
cat("Heckman educ:  ", coef(heck, part = "outcome")["educ"], "\n")
cat("True educ:     ", true_beta_educ, "\n")

Expected output:

Estimatoreducexperlambda (IMR)
Naive OLS~0.065~0.025
Heckman two-step~0.080~0.030~0.30
True values0.0800.030

The Heckman-corrected coefficients are closer to the true values. The positive and significant lambda indicates positive selection: women who choose to work have higher unobserved wage potential than non-workers.

Concept Check

The estimated lambda (coefficient on the inverse Mills ratio) is positive and statistically significant. What does this tell us about selection?


Step 5: Compare with Naive OLS

Let us visualize the difference between naive OLS and Heckman-corrected estimates, and understand when the correction matters most.

# Full comparison table
cat("========================================\n")
cat("         Naive OLS vs. Heckman          \n")
cat("========================================\n")
cat(sprintf("%-15s %10s %10s %10s\n", "Parameter", "OLS", "Heckman", "True"))
cat(sprintf("%-15s %10.4f %10.4f %10.4f\n", "educ",
  coef(ols_biased)["educ"],
  coef(heck, part = "outcome")["educ"],
  true_beta_educ))
cat(sprintf("%-15s %10.4f %10.4f %10.4f\n", "exper",
  coef(ols_biased)["exper"],
  coef(heck, part = "outcome")["exper"],
  true_beta_exper))

# Extract rho and sigma from the Heckman model
cat("\nSelection parameters:\n")
cat("  rho (error correlation):", heck$rho, "\n")
cat("  sigma (wage error SD):", heck$sigma, "\n")
cat("  lambda = rho * sigma:", heck$rho * heck$sigma, "\n")

# OLS on the FULL sample (if wages were observed for everyone)
# This is the benchmark we would get without selection
ols_full <- lm(log_wage ~ educ + exper,
             data = data.frame(log_wage = log_wage, educ, exper))
cat("\nOLS on full population (no selection):\n")
cat("  educ:", coef(ols_full)["educ"], "\n")
cat("  exper:", coef(ols_full)["exper"], "\n")

Expected output:

Estimatoreducexperlambda
OLS full population~0.080~0.030
Naive OLS (workers only)~0.065~0.025
Heckman two-step~0.080~0.030~0.30
True values0.0800.030

The Heckman estimator recovers coefficients close to the true values and close to what we would get if we could observe wages for the entire population. The naive OLS on workers is biased because it ignores selection.


Step 6: Check Exclusion Restriction Strength

The exclusion restriction is what gives the Heckman model practical identification beyond functional form. Let us test what happens when the exclusion restrictions are weak.

# Test 1: Joint significance of exclusion restrictions in probit
probit_full <- glm(participate ~ educ + exper + nchild + husb_inc,
                 data = df, family = binomial(link = "probit"))
probit_restricted <- glm(participate ~ educ + exper,
                       data = df, family = binomial(link = "probit"))

lr_test <- anova(probit_restricted, probit_full, test = "Chisq")
cat("LR test for exclusion restrictions:\n")
print(lr_test)

# Test 2: What happens WITHOUT exclusion restrictions
cat("\n=== Heckman WITHOUT exclusion restrictions ===\n")
cat("(Identification from functional form only)\n")
heck_no_excl <- selection(participate ~ educ + exper,
                        log_wage ~ educ + exper, data = df)
cat("educ (no excl. restr.):",
  coef(heck_no_excl, part = "outcome")["educ"], "\n")
cat("educ (with excl. restr.):",
  coef(heck, part = "outcome")["educ"], "\n")
cat("educ (true):", true_beta_educ, "\n")
cat("\nWithout exclusion restrictions, estimates are fragile.\n")

Expected output:

Scenarioeduc estimateStable?
With exclusion restrictions~0.080Yes
Without exclusion restrictionsUnstableNo — identified only through functional form
True value0.080

The LR test should strongly reject the null that the exclusion restriction variables have zero coefficients in the selection equation. Without these variables, the Heckman model relies on functional form alone and produces fragile estimates.


Step 7: Sensitivity Analysis

Check how the Heckman estimates change under different assumptions about the correlation between errors.

# Simulate data with different rho values
rho_values <- c(-0.5, 0, 0.3, 0.6, 0.9)
results <- data.frame(rho = numeric(), ols_educ = numeric(),
                    heck_educ = numeric(), lambda = numeric())

for (r in rho_values) {
Sig <- matrix(c(1, r, r, 1), 2, 2)
errs <- mvrnorm(n, mu = c(0, 0), Sigma = Sig)

part_r <- as.integer(z_gamma + errs[, 1] > 0)
lw_r <- 1.0 + true_beta_educ * educ + true_beta_exper * exper + 0.5 * errs[, 2]
lw_obs <- ifelse(part_r == 1, lw_r, NA)

df_r <- data.frame(log_wage = lw_obs, educ, exper, nchild, husb_inc,
                   participate = part_r)

ols_r <- lm(log_wage ~ educ + exper, data = df_r, subset = participate == 1)
heck_r <- selection(participate ~ educ + exper + nchild + husb_inc,
                    log_wage ~ educ + exper, data = df_r)

results <- rbind(results, data.frame(
  rho = r,
  ols_educ = coef(ols_r)["educ"],
  heck_educ = coef(heck_r, part = "outcome")["educ"],
  lambda = heck_r$rho * heck_r$sigma
))
}

cat("Sensitivity to selection strength (rho):\n")
print(round(results, 4))
cat("\nTrue educ coefficient:", true_beta_educ, "\n")

Expected output:

rhoOLS educHeckman educlambda
-0.5~0.095~0.080~-0.25
0.0~0.080~0.080~0.00
0.3~0.072~0.080~0.15
0.6~0.065~0.080~0.30
0.9~0.055~0.080~0.45

When rho = 0 (no selection), OLS and Heckman give the same result. As rho increases, OLS bias grows, but the Heckman estimator consistently recovers the true coefficient.


Step 8: Exercises

Guided Exercise

Interpreting Lambda and Rho

You estimate a Heckman model of wages for a sample of employed individuals. The coefficient on the inverse Mills ratio (lambda) is 0.35 with a standard error of 0.08. The estimated sigma (standard deviation of the wage equation error) is 0.50.

What is the implied correlation (rho) between selection and wage errors? (rho = lambda / sigma)

Is the selection positive or negative? (answer: positive or negative)

Is OLS on the selected sample biased for the population returns to education? (answer: yes or no)

  1. Different exclusion restrictions. Replace nchild and husb_inc with other variables (e.g., local childcare availability, distance to nearest employer). How do the estimates change? What makes a good exclusion restriction?

  2. Heckman MLE vs. two-step. The maximum likelihood estimator jointly estimates the selection and outcome equations. Compare MLE and two-step estimates. MLE is more efficient but more sensitive to distributional assumptions.

  3. Non-normal errors. What happens if the errors are not bivariate normal? Try simulating with t-distributed errors and see how the Heckman estimator performs.


Key Takeaways

  • Sample selection bias arises when the outcome is observed only for a non-random subsample, and the selection process is correlated with unobserved determinants of the outcome
  • The Heckman two-step estimator corrects for selection bias by including the inverse Mills ratio as an additional regressor in the outcome equation
  • The inverse Mills ratio captures the expected value of the truncated selection error, conditional on being selected
  • Lambda (the coefficient on the IMR) equals rho times sigma, where rho is the correlation between selection and outcome errors
  • A credible exclusion restriction is essential: without it, the model is identified only through fragile functional form assumptions
  • The sign of lambda indicates the direction of selection: positive lambda means positive selection (selected individuals have higher unobserved outcome potential)
  • Always report the exclusion restriction, its economic justification, and the significance of lambda