Lab·tutorial·10 min read

tutorial90 minutes

Lab: Heckman Selection Model from Scratch

Implement the Heckman two-step correction for sample selection bias. Simulate a selection model with correlated errors, estimate the probit selection equation, compute the inverse Mills ratio, and compare corrected estimates with naive OLS.

MethodHeckman Selection Model

LanguagesPython, R, Stata

DatasetFemale labor force participation (simulated matching Mroz 1987)

Overview

Sample selection bias arises whenever we observe the outcome variable only for a non-random subsample. The classic example: we observe wages only for women who choose to work. If the factors driving the participation decision are correlated with unobserved determinants of wages, OLS on the selected sample is biased. Heckman's (1979) two-step estimator corrects for this bias using information from the selection equation.

What you will learn:

Why OLS on a selected sample is biased and in which direction
How to specify and estimate the probit selection equation
How to compute the inverse Mills ratio (IMR) and what it represents
How to run the Heckman two-step correction and interpret lambda
Why the exclusion restriction matters for identification
How to compare corrected estimates with naive OLS

Prerequisites: OLS regression, basic understanding of probit models and maximum likelihood.

Step 1: The Selection Problem

We want to estimate the wage equation for women. The problem: wages are observed only for women who participate in the labor force. If unobserved factors (e.g., motivation, household preferences) affect both the participation decision and wages, OLS on the working subsample is biased.

1library(sampleSelection)
2library(MASS)
3
4set.seed(42)
5n <- 5000
6
7# Covariates
8educ <- round(rnorm(n, mean = 12, sd = 3))
9exper <- pmax(round(rnorm(n, mean = 15, sd = 8)), 0)
10age <- 25 + round(exper * 0.8 + rnorm(n, sd = 3))
11nchild <- rpois(n, lambda = 1.5)  # Exclusion restriction variable
12husb_inc <- pmax(rnorm(n, mean = 30, sd = 15), 0)  # Exclusion restriction variable
13
14# Correlated errors: rho = 0.6 (positive selection)
15rho <- 0.6
16Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
17errors <- mvrnorm(n, mu = c(0, 0), Sigma = Sigma)
18u_select <- errors[, 1]   # Selection equation error
19u_wage   <- errors[, 2]   # Wage equation error
20
21# Selection equation (latent): participate if z*gamma + u_s > 0
22z_gamma <- -1.5 + 0.15 * educ + 0.02 * exper - 0.3 * nchild - 0.02 * husb_inc
23participate <- as.integer(z_gamma + u_select > 0)
24cat("Participation rate:", mean(participate), "\n")
25
26# Wage equation (observed only for participants)
27true_beta_educ  <- 0.08
28true_beta_exper <- 0.03
29log_wage <- 1.0 + true_beta_educ * educ + true_beta_exper * exper + 0.5 * u_wage
30log_wage_obs <- ifelse(participate == 1, log_wage, NA)
31
32df <- data.frame(log_wage = log_wage_obs, educ, exper, age,
33               nchild, husb_inc, participate)
34
35# Naive OLS on workers only (biased due to selection)
36ols_biased <- lm(log_wage ~ educ + exper, data = df, subset = participate == 1)
37cat("\nNaive OLS (selected sample):\n")
38cat("  educ coeff:", coef(ols_biased)["educ"], "(true:", true_beta_educ, ")\n")
39cat("  exper coeff:", coef(ols_biased)["exper"], "(true:", true_beta_exper, ")\n")

RequiresMASS

Expected output:

Estimator	educ coefficient	exper coefficient	True educ	True exper
Naive OLS	~0.065	~0.025	0.080	0.030

The naive OLS coefficients are biased. With positive selection (rho > 0), women who participate tend to have higher unobserved wage potential, creating a non-random sample.

Step 2: Estimate the Probit Selection Equation

The first step of the Heckman procedure is to estimate a probit model for the participation decision. This equation should include all variables in the wage equation plus at least one exclusion restriction — a variable that affects participation but not wages directly.

1# Probit selection equation
2# Exclusion restrictions: nchild, husb_inc
3# These affect participation but not wages directly
4probit <- glm(participate ~ educ + exper + nchild + husb_inc,
5            data = df, family = binomial(link = "probit"))
6
7summary(probit)
8
9cat("\nKey coefficients:\n")
10cat("  nchild (excl. restr.):", coef(probit)["nchild"],
11  "- more children reduce participation\n")
12cat("  husb_inc (excl. restr.):", coef(probit)["husb_inc"],
13  "- higher husband income reduces participation\n")

Expected output:

Variable	Probit Coefficient	Interpretation
educ	~0.15	More education increases participation
exper	~0.02	More experience increases participation
nchild	~-0.30	More children reduce participation (exclusion restriction)
husb_inc	~-0.02	Higher husband income reduces participation (exclusion restriction)

The exclusion restriction variables (nchild, husb_inc) are significant predictors of participation. These variables must plausibly have no direct effect on wages — they affect wages only through their effect on the participation decision.

Concept Check

Why is the exclusion restriction important in the Heckman model? What happens if the selection equation includes exactly the same variables as the wage equation?

The model is not identified at all and cannot be estimated.The model is identified only through the nonlinearity of the IMR, which provides very weak identification and makes estimates unreliable.The exclusion restriction does not matter — the model works fine without it.Without the exclusion restriction, the standard errors are correct but the point estimates are biased.

Step 3: Compute the Inverse Mills Ratio

The inverse Mills ratio (IMR) is the key correction term. For each observation, it measures the expected value of the truncated selection error, conditional on being selected into the sample.

1# Predicted index from the probit (X * gamma_hat)
2xb <- predict(probit, type = "link")
3
4# Inverse Mills ratio: lambda = phi(xb) / Phi(xb)
5# phi = standard normal PDF, Phi = standard normal CDF
6imr <- dnorm(xb) / pnorm(xb)
7
8df$imr <- imr
9df$xb  <- xb
10
11cat("IMR summary for participants:\n")
12summary(imr[df$participate == 1])
13cat("\nIMR summary for non-participants:\n")
14summary(imr[df$participate == 0])
15cat("\nIMR is larger for observations near the selection margin\n")
16cat("(where participation probability is lower).\n")

Expected output:

The inverse Mills ratio is larger for observations near the participation margin (lower predicted probability of participation). For participants with very high predicted probability, the IMR is close to zero — there is little selection correction needed because these individuals would almost certainly participate regardless.

Step 4: Heckman Two-Step Estimation

Now run the second step: add the inverse Mills ratio as an additional regressor in the wage equation. The coefficient on the IMR (often called lambda) captures the selection bias.

1# Step 2: Add IMR to the wage equation (participants only)
2heckman_manual <- lm(log_wage ~ educ + exper + imr,
3                   data = df, subset = participate == 1)
4
5cat("=== Heckman Two-Step (Manual) ===\n")
6cat("educ coeff:", coef(heckman_manual)["educ"],
7  "(true:", true_beta_educ, ")\n")
8cat("exper coeff:", coef(heckman_manual)["exper"],
9  "(true:", true_beta_exper, ")\n")
10cat("lambda (IMR coeff):", coef(heckman_manual)["imr"], "\n")
11cat("\nNote: SEs from manual two-step are incorrect.\n")
12cat("Use the sampleSelection package for correct SEs.\n")
13
14# Proper Heckman using sampleSelection package
15heck <- selection(participate ~ educ + exper + nchild + husb_inc,
16                log_wage ~ educ + exper, data = df)
17summary(heck)
18
19cat("\n=== Comparison ===\n")
20cat("Naive OLS educ:", coef(ols_biased)["educ"], "\n")
21cat("Heckman educ:  ", coef(heck, part = "outcome")["educ"], "\n")
22cat("True educ:     ", true_beta_educ, "\n")

Expected output:

Estimator	educ	exper	lambda (IMR)
Naive OLS	~0.065	~0.025	—
Heckman two-step	~0.080	~0.030	~0.30
True values	0.080	0.030	—

The Heckman-corrected coefficients are closer to the true values. The positive and significant lambda indicates positive selection: women who choose to work have higher unobserved wage potential than non-workers.

Concept Check

The estimated lambda (coefficient on the inverse Mills ratio) is positive and statistically significant. What does this tell us about selection?

There is negative selection: women who work have lower unobserved wage potential.There is positive selection: women who work tend to have higher unobserved wage potential than non-workers. The error terms in the selection and wage equations are positively correlated (rho > 0).There is no selection bias — the significance of lambda just means the model is well specified.Lambda is significant because the exclusion restrictions are weak.

Step 5: Compare with Naive OLS

Let us visualize the difference between naive OLS and Heckman-corrected estimates, and understand when the correction matters most.

1# Full comparison table
2cat("========================================\n")
3cat("         Naive OLS vs. Heckman          \n")
4cat("========================================\n")
5cat(sprintf("%-15s %10s %10s %10s\n", "Parameter", "OLS", "Heckman", "True"))
6cat(sprintf("%-15s %10.4f %10.4f %10.4f\n", "educ",
7  coef(ols_biased)["educ"],
8  coef(heck, part = "outcome")["educ"],
9  true_beta_educ))
10cat(sprintf("%-15s %10.4f %10.4f %10.4f\n", "exper",
11  coef(ols_biased)["exper"],
12  coef(heck, part = "outcome")["exper"],
13  true_beta_exper))
14
15# Extract rho and sigma from the Heckman model
16cat("\nSelection parameters:\n")
17cat("  rho (error correlation):", heck$rho, "\n")
18cat("  sigma (wage error SD):", heck$sigma, "\n")
19cat("  lambda = rho * sigma:", heck$rho * heck$sigma, "\n")
20
21# OLS on the FULL sample (if wages were observed for everyone)
22# This is the benchmark we would get without selection
23ols_full <- lm(log_wage ~ educ + exper,
24             data = data.frame(log_wage = log_wage, educ, exper))
25cat("\nOLS on full population (no selection):\n")
26cat("  educ:", coef(ols_full)["educ"], "\n")
27cat("  exper:", coef(ols_full)["exper"], "\n")

Expected output:

Estimator	educ	exper	lambda
OLS full population	~0.080	~0.030	—
Naive OLS (workers only)	~0.065	~0.025	—
Heckman two-step	~0.080	~0.030	~0.30
True values	0.080	0.030	—

The Heckman estimator recovers coefficients close to the true values and close to what we would get if we could observe wages for the entire population. The naive OLS on workers is biased because it ignores selection.

Step 6: Check Exclusion Restriction Strength

The exclusion restriction is what gives the Heckman model practical identification beyond functional form. Let us test what happens when the exclusion restrictions are weak.

1# Test 1: Joint significance of exclusion restrictions in probit
2probit_full <- glm(participate ~ educ + exper + nchild + husb_inc,
3                 data = df, family = binomial(link = "probit"))
4probit_restricted <- glm(participate ~ educ + exper,
5                       data = df, family = binomial(link = "probit"))
6
7lr_test <- anova(probit_restricted, probit_full, test = "Chisq")
8cat("LR test for exclusion restrictions:\n")
9print(lr_test)
10
11# Test 2: What happens WITHOUT exclusion restrictions
12cat("\n=== Heckman WITHOUT exclusion restrictions ===\n")
13cat("(Identification from functional form only)\n")
14heck_no_excl <- selection(participate ~ educ + exper,
15                        log_wage ~ educ + exper, data = df)
16cat("educ (no excl. restr.):",
17  coef(heck_no_excl, part = "outcome")["educ"], "\n")
18cat("educ (with excl. restr.):",
19  coef(heck, part = "outcome")["educ"], "\n")
20cat("educ (true):", true_beta_educ, "\n")
21cat("\nWithout exclusion restrictions, estimates are fragile.\n")

Expected output:

Scenario	educ estimate	Stable?
With exclusion restrictions	~0.080	Yes
Without exclusion restrictions	Unstable	No — identified only through functional form
True value	0.080	—

The LR test should strongly reject the null that the exclusion restriction variables have zero coefficients in the selection equation. Without these variables, the Heckman model relies on functional form alone and produces fragile estimates.

Step 7: Sensitivity Analysis

Check how the Heckman estimates change under different assumptions about the correlation between errors.

1# Simulate data with different rho values
2rho_values <- c(-0.5, 0, 0.3, 0.6, 0.9)
3results <- data.frame(rho = numeric(), ols_educ = numeric(),
4                    heck_educ = numeric(), lambda = numeric())
5
6for (r in rho_values) {
7Sig <- matrix(c(1, r, r, 1), 2, 2)
8errs <- mvrnorm(n, mu = c(0, 0), Sigma = Sig)
9
10part_r <- as.integer(z_gamma + errs[, 1] > 0)
11lw_r <- 1.0 + true_beta_educ * educ + true_beta_exper * exper + 0.5 * errs[, 2]
12lw_obs <- ifelse(part_r == 1, lw_r, NA)
13
14df_r <- data.frame(log_wage = lw_obs, educ, exper, nchild, husb_inc,
15                   participate = part_r)
16
17ols_r <- lm(log_wage ~ educ + exper, data = df_r, subset = participate == 1)
18heck_r <- selection(participate ~ educ + exper + nchild + husb_inc,
19                    log_wage ~ educ + exper, data = df_r)
20
21results <- rbind(results, data.frame(
22  rho = r,
23  ols_educ = coef(ols_r)["educ"],
24  heck_educ = coef(heck_r, part = "outcome")["educ"],
25  lambda = heck_r$rho * heck_r$sigma
26))
27}
28
29cat("Sensitivity to selection strength (rho):\n")
30print(round(results, 4))
31cat("\nTrue educ coefficient:", true_beta_educ, "\n")

Expected output:

rho	OLS educ	Heckman educ	lambda
-0.5	~0.095	~0.080	~-0.25
0.0	~0.080	~0.080	~0.00
0.3	~0.072	~0.080	~0.15
0.6	~0.065	~0.080	~0.30
0.9	~0.055	~0.080	~0.45

When rho = 0 (no selection), OLS and Heckman give the same result. As rho increases, OLS bias grows, but the Heckman estimator consistently recovers the true coefficient.

Step 8: Exercises

Guided Exercise

Interpreting Lambda and Rho

You estimate a Heckman model of wages for a sample of employed individuals. The coefficient on the inverse Mills ratio (lambda) is 0.35 with a standard error of 0.08. The estimated sigma (standard deviation of the wage equation error) is 0.50.

Different exclusion restrictions. Replace nchild and husb_inc with other variables (e.g., local childcare availability, distance to nearest employer). How do the estimates change? What makes a good exclusion restriction?
Heckman MLE vs. two-step. The maximum likelihood estimator jointly estimates the selection and outcome equations. Compare MLE and two-step estimates. MLE is more efficient but more sensitive to distributional assumptions.
Non-normal errors. What happens if the errors are not bivariate normal? Try simulating with t-distributed errors and see how the Heckman estimator performs.

✓Key Takeaways

Sample selection bias arises when the outcome is observed only for a non-random subsample, and the selection process is correlated with unobserved determinants of the outcome
The Heckman two-step estimator corrects for selection bias by including the inverse Mills ratio as an additional regressor in the outcome equation
The inverse Mills ratio captures the expected value of the truncated selection error, conditional on being selected
Lambda (the coefficient on the IMR) equals rho times sigma, where rho is the correlation between selection and outcome errors
A credible exclusion restriction is essential: without it, the model is identified only through fragile functional form assumptions
The sign of lambda indicates the direction of selection: positive lambda means positive selection (selected individuals have higher unobserved outcome potential)
Always report the exclusion restriction, its economic justification, and the significance of lambda

Overview#

Step 1: The Selection Problem#

Step 2: Estimate the Probit Selection Equation#

Step 3: Compute the Inverse Mills Ratio#

Step 4: Heckman Two-Step Estimation#

Step 5: Compare with Naive OLS#

Step 6: Check Exclusion Restriction Strength#

Step 7: Sensitivity Analysis#

Step 8: Exercises#

✓Key Takeaways#

Overview

Step 1: The Selection Problem

Step 2: Estimate the Probit Selection Equation

Step 3: Compute the Inverse Mills Ratio

Step 4: Heckman Two-Step Estimation

Step 5: Compare with Naive OLS

Step 6: Check Exclusion Restriction Strength

Step 7: Sensitivity Analysis

Step 8: Exercises

✓Key Takeaways