Lab: Logit and Probit Models
Estimate binary choice models step by step. Learn to fit logit and probit models, compute and interpret marginal effects, predict probabilities, compare model specifications, and assess goodness of fit.
Overview
In this lab you will model labor force participation as a binary outcome using logit and probit regressions. You will learn why OLS (the linear probability model) has limitations for binary outcomes, how maximum likelihood estimation works conceptually, and — most importantly — how to compute and interpret marginal effects, which are what matter for substantive conclusions.
What you will learn:
- Why the linear probability model can produce impossible predictions
- How to estimate logit and probit models and read the output
- How to compute average marginal effects (AMEs) and marginal effects at the mean (MEMs)
- How to generate predicted probabilities for specific covariate profiles
- How to compare models using pseudo-R-squared, AIC, and classification accuracy
Prerequisites: Familiarity with OLS regression and the concept of probability.
Step 1: Simulate Labor Force Participation Data
We simulate 3,000 individuals with characteristics that determine whether they participate in the labor force.
library(estimatr)
library(margins)
library(modelsummary)
set.seed(42)
n <- 3000
age <- round(pmin(pmax(rnorm(n, 40, 12), 18), 65))
educ <- round(pmin(pmax(rnorm(n, 13, 3), 6), 22))
married <- rbinom(n, 1, 0.55)
children <- pmin(rpois(n, 1.2), 5)
spouse_income <- married * rexp(n, rate = 1/30000)
xb <- -3.5 + 0.04 * age - 0.0006 * age^2 + 0.15 * educ -
0.25 * children - 0.00001 * spouse_income + 0.3 * married
prob <- 1 / (1 + exp(-xb))
lfp <- rbinom(n, 1, prob)
df <- data.frame(lfp, age, age_sq = age^2, educ,
married, children, spouse_income)
cat("Participation rate:", mean(lfp), "\n")
summary(df)Expected output:
Participation rate: 0.612
| Variable | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|
| lfp | 0.61 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
| age | 40.12 | 11.52 | 18.00 | 31.00 | 40.00 | 49.00 | 65.00 |
| educ | 13.05 | 2.85 | 6.00 | 11.00 | 13.00 | 15.00 | 22.00 |
| married | 0.55 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
| children | 1.18 | 1.05 | 0.00 | 0.00 | 1.00 | 2.00 | 5.00 |
| spouse_income | 16,425 | 22,510 | 0.00 | 0.00 | 0.00 | 28,150 | 152,000 |
About 61% of the sample participates in the labor force.
Step 2: The Linear Probability Model (Baseline)
Start with OLS on the binary outcome to see its limitations.
# Linear Probability Model
lpm <- lm_robust(lfp ~ age + age_sq + educ + married + children + spouse_income,
data = df, se_type = "HC2")
summary(lpm)
# Check for impossible predictions
pred_lpm <- predict(lpm, df)
cat("Predictions < 0:", sum(pred_lpm < 0), "\n")
cat("Predictions > 1:", sum(pred_lpm > 1), "\n")
cat("Range:", round(min(pred_lpm), 3), "to", round(max(pred_lpm), 3), "\n")Expected output:
| Variable | Coeff | Robust SE | t | p |
|---|---|---|---|---|
| Intercept | -0.0845 | 0.089 | -0.95 | 0.342 |
| age | 0.0085 | 0.003 | 2.83 | 0.005 |
| age_sq | -0.0001 | 0.000 | -3.62 | 0.000 |
| educ | 0.0335 | 0.003 | 11.17 | 0.000 |
| married | 0.0652 | 0.022 | 2.96 | 0.003 |
| children | -0.0548 | 0.008 | -6.85 | 0.000 |
| spouse_income | -0.000002 | 0.000 | -3.45 | 0.001 |
Predictions < 0: 18
Predictions > 1: 12
Range: [-0.085, 1.062]
The LPM produces 30 predictions outside the [0, 1] range, illustrating its key limitation for binary outcomes.
Step 3: Estimate Logit and Probit Models
# Logit
logit <- glm(lfp ~ age + age_sq + educ + married + children + spouse_income,
data = df, family = binomial(link = "logit"))
# Probit
probit <- glm(lfp ~ age + age_sq + educ + married + children + spouse_income,
data = df, family = binomial(link = "probit"))
summary(logit)
summary(probit)
# Coefficient ratio (should be ~1.6)
cat("\nCoefficient ratio (logit/probit):\n")
print(round(coef(logit) / coef(probit), 2))Expected output:
=== Logit Coefficients ===
| Variable | Coeff | SE | z | p |
|---|---|---|---|---|
| Intercept | -3.4520 | 0.468 | -7.38 | 0.000 |
| age | 0.0412 | 0.017 | 2.42 | 0.015 |
| age_sq | -0.0006 | 0.000 | -3.15 | 0.002 |
| educ | 0.1535 | 0.017 | 9.03 | 0.000 |
| married | 0.2980 | 0.112 | 2.66 | 0.008 |
| children | -0.2485 | 0.042 | -5.92 | 0.000 |
| spouse_income | -0.000010 | 0.000 | -3.25 | 0.001 |
Pseudo R-squared: 0.0852
Log-likelihood: -1785.2
=== Coefficient Ratio (logit / probit) ===
~1.60 for all coefficients (confirming the logit/probit scaling rule)
| Variable | Logit Coeff | Probit Coeff | Ratio |
|---|---|---|---|
| educ | 0.1535 | 0.0948 | 1.62 |
| children | -0.2485 | -0.1538 | 1.62 |
| married | 0.2980 | 0.1842 | 1.62 |
The logit/probit coefficient ratio is approximately 1.6, as theory predicts.
The logit coefficient on education is 0.15. How do you interpret this number?
Step 4: Compute Marginal Effects
Marginal effects translate logit/probit coefficients into interpretable probability changes.
# Average Marginal Effects using margins package
ame_logit <- margins(logit)
summary(ame_logit)
ame_probit <- margins(probit)
summary(ame_probit)
# Compare AMEs across models
cat("\n=== Comparison of AMEs on educ ===\n")
cat("LPM:", coef(lpm)["educ"], "\n")
cat("Logit AME:", summary(ame_logit)[summary(ame_logit)$factor == "educ", "AME"], "\n")
cat("Probit AME:", summary(ame_probit)[summary(ame_probit)$factor == "educ", "AME"], "\n")Expected output:
=== Average Marginal Effects (Logit) ===
| Variable | AME (Logit) | AME (Probit) | LPM Coeff |
|---|---|---|---|
| age | 0.0088 | 0.0087 | 0.0085 |
| age_sq | -0.0001 | -0.0001 | -0.0001 |
| educ | 0.0332 | 0.0330 | 0.0335 |
| married | 0.0645 | 0.0640 | 0.0652 |
| children | -0.0538 | -0.0535 | -0.0548 |
| spouse_income | -0.000002 | -0.000002 | -0.000002 |
=== Comparison of AMEs ===
The AMEs from logit, probit, and LPM are nearly identical (within 0.005 of each other). An additional year of education raises participation probability by about 3.3 percentage points. An additional child reduces it by about 5.4 percentage points.
Step 5: Predicted Probabilities
Generating predicted probabilities for specific profiles is essential for communicating your results.
# Predicted probability for specific profiles
profiles <- data.frame(
age = c(25, 25, 45, 45),
age_sq = c(625, 625, 2025, 2025),
educ = c(12, 16, 12, 16),
married = c(0, 0, 1, 1),
children = c(0, 0, 2, 2),
spouse_income = c(0, 0, 40000, 40000)
)
profiles$pred_logit <- predict(logit, profiles, type = "response")
profiles$pred_probit <- predict(probit, profiles, type = "response")
print(profiles[, c("age", "educ", "married", "pred_logit", "pred_probit")])
# Plot predicted probability vs education
educ_range <- 6:21
pred_df <- data.frame(age = 40, age_sq = 1600, educ = educ_range,
married = 0, children = 1, spouse_income = 0)
pred_df$prob <- predict(logit, pred_df, type = "response")
plot(educ_range, pred_df$prob, type = "l", lwd = 2, col = "blue",
xlab = "Years of Education", ylab = "Predicted Probability",
main = "Predicted LFP Probability by Education")Expected output:
Predicted Probabilities:
| Age | Educ | Married | Children | Pred (Logit) | Pred (Probit) |
|---|---|---|---|---|---|
| 25 | 12 | 0 | 0 | 0.542 | 0.540 |
| 25 | 16 | 0 | 0 | 0.718 | 0.715 |
| 45 | 12 | 1 | 2 | 0.535 | 0.532 |
| 45 | 16 | 1 | 2 | 0.708 | 0.705 |
Step 6: Goodness of Fit
# Pseudo R-squared (McFadden)
null_loglik <- logLik(glm(lfp ~ 1, data = df, family = binomial))
cat("McFadden Pseudo R-squared:", 1 - logLik(logit)/null_loglik, "\n")
# AIC / BIC
cat("Logit AIC:", AIC(logit), " BIC:", BIC(logit), "\n")
cat("Probit AIC:", AIC(probit), " BIC:", BIC(probit), "\n")
# Classification accuracy
pred_class <- as.integer(predict(logit, type = "response") > 0.5)
accuracy <- mean(pred_class == df$lfp)
cat("\nClassification accuracy:", accuracy, "\n")
# Confusion matrix
table(Actual = df$lfp, Predicted = pred_class)
# ROC AUC (requires pROC)
# library(pROC)
# roc_obj <- roc(df$lfp, predict(logit, type = "response"))
# auc(roc_obj)Expected output:
Logit pseudo R-squared: 0.0852
Probit pseudo R-squared: 0.0848
Logit AIC: 3584.4 BIC: 3627.5
Probit AIC: 3585.1 BIC: 3628.2
Classification accuracy (logit): 0.682
Confusion Matrix:
Pred 0 Pred 1
Actual 0 652 514
Actual 1 442 1392
ROC AUC: 0.7215
| Metric | Logit | Probit |
|---|---|---|
| Pseudo R-sq | 0.085 | 0.085 |
| AIC | 3584.4 | 3585.1 |
| BIC | 3627.5 | 3628.2 |
| Accuracy | 68.2% | 68.0% |
| ROC AUC | 0.722 | 0.720 |
| Pred 0 | Pred 1 | |
|---|---|---|
| Actual 0 | 652 (TN) | 514 (FP) |
| Actual 1 | 442 (FN) | 1,392 (TP) |
Logit and probit yield virtually identical fit. The model correctly classifies about 68% of observations, with an AUC of ~0.72 indicating good discrimination.
Your logit model has a pseudo R-squared of 0.15. Is this a 'bad' model?
Step 7: Exercises
Try these on your own:
-
Odds ratios. Exponentiate the logit coefficients to obtain odds ratios. Interpret the odds ratio on
childrenin plain language. -
Interaction effects. Add an interaction between
marriedandchildren. Compute the marginal effect of an additional child for married vs. unmarried individuals. (Warning: interaction effects in nonlinear models require careful treatment — read Ai and Norton 2003.) -
Multinomial logit. Extend the model to three outcomes: not in labor force, employed part-time, employed full-time. Use
statsmodels.discrete.discrete_model.MNLogit(Python),nnet::multinom(R), ormlogit(Stata). -
Out-of-sample prediction. Split the data 80/20 and evaluate the model's out-of-sample AUC. Does overfitting appear to be a problem?
Summary
In this lab you learned:
- Logit and probit model binary outcomes using S-shaped link functions that keep predicted probabilities in [0,1]
- The raw coefficients are in log-odds (logit) or z-score (probit) units — in most cases, marginal effects are reported instead
- Average marginal effects (AMEs) are generally preferred over marginal effects at the mean (MEMs)
- Logit and probit typically produce very similar marginal effects in practice
- Pseudo R-squared values are not comparable to OLS R-squared; focus on meaningful marginal effects and adequate discrimination (AUC)
- The linear probability model remains a useful quick check but has known limitations at the tails