MethodAtlas
tutorial90 minutes

Lab: Logit and Probit Models

Estimate binary choice models step by step. Learn to fit logit and probit models, compute and interpret marginal effects, predict probabilities, compare model specifications, and assess goodness of fit.

Overview

In this lab you will model labor force participation as a binary outcome using logit and probit regressions. You will learn why OLS (the linear probability model) has limitations for binary outcomes, how maximum likelihood estimation works conceptually, and — most importantly — how to compute and interpret marginal effects, which are what matter for substantive conclusions.

What you will learn:

  • Why the linear probability model can produce impossible predictions
  • How to estimate logit and probit models and read the output
  • How to compute average marginal effects (AMEs) and marginal effects at the mean (MEMs)
  • How to generate predicted probabilities for specific covariate profiles
  • How to compare models using pseudo-R-squared, AIC, and classification accuracy

Prerequisites: Familiarity with OLS regression and the concept of probability.


Step 1: Simulate Labor Force Participation Data

We simulate 3,000 individuals with characteristics that determine whether they participate in the labor force.

library(estimatr)
library(margins)
library(modelsummary)

set.seed(42)
n <- 3000

age <- round(pmin(pmax(rnorm(n, 40, 12), 18), 65))
educ <- round(pmin(pmax(rnorm(n, 13, 3), 6), 22))
married <- rbinom(n, 1, 0.55)
children <- pmin(rpois(n, 1.2), 5)
spouse_income <- married * rexp(n, rate = 1/30000)

xb <- -3.5 + 0.04 * age - 0.0006 * age^2 + 0.15 * educ -
    0.25 * children - 0.00001 * spouse_income + 0.3 * married

prob <- 1 / (1 + exp(-xb))
lfp <- rbinom(n, 1, prob)

df <- data.frame(lfp, age, age_sq = age^2, educ,
               married, children, spouse_income)

cat("Participation rate:", mean(lfp), "\n")
summary(df)

Expected output:

Participation rate: 0.612
Variablemeanstdmin25%50%75%max
lfp0.610.490.000.001.001.001.00
age40.1211.5218.0031.0040.0049.0065.00
educ13.052.856.0011.0013.0015.0022.00
married0.550.500.000.001.001.001.00
children1.181.050.000.001.002.005.00
spouse_income16,42522,5100.000.000.0028,150152,000

About 61% of the sample participates in the labor force.


Step 2: The Linear Probability Model (Baseline)

Start with OLS on the binary outcome to see its limitations.

# Linear Probability Model
lpm <- lm_robust(lfp ~ age + age_sq + educ + married + children + spouse_income,
                data = df, se_type = "HC2")
summary(lpm)

# Check for impossible predictions
pred_lpm <- predict(lpm, df)
cat("Predictions < 0:", sum(pred_lpm < 0), "\n")
cat("Predictions > 1:", sum(pred_lpm > 1), "\n")
cat("Range:", round(min(pred_lpm), 3), "to", round(max(pred_lpm), 3), "\n")

Expected output:

VariableCoeffRobust SEtp
Intercept-0.08450.089-0.950.342
age0.00850.0032.830.005
age_sq-0.00010.000-3.620.000
educ0.03350.00311.170.000
married0.06520.0222.960.003
children-0.05480.008-6.850.000
spouse_income-0.0000020.000-3.450.001
Predictions < 0: 18
Predictions > 1: 12
Range: [-0.085, 1.062]

The LPM produces 30 predictions outside the [0, 1] range, illustrating its key limitation for binary outcomes.


Step 3: Estimate Logit and Probit Models

# Logit
logit <- glm(lfp ~ age + age_sq + educ + married + children + spouse_income,
           data = df, family = binomial(link = "logit"))

# Probit
probit <- glm(lfp ~ age + age_sq + educ + married + children + spouse_income,
            data = df, family = binomial(link = "probit"))

summary(logit)
summary(probit)

# Coefficient ratio (should be ~1.6)
cat("\nCoefficient ratio (logit/probit):\n")
print(round(coef(logit) / coef(probit), 2))

Expected output:

=== Logit Coefficients ===
VariableCoeffSEzp
Intercept-3.45200.468-7.380.000
age0.04120.0172.420.015
age_sq-0.00060.000-3.150.002
educ0.15350.0179.030.000
married0.29800.1122.660.008
children-0.24850.042-5.920.000
spouse_income-0.0000100.000-3.250.001
Pseudo R-squared: 0.0852
Log-likelihood: -1785.2

=== Coefficient Ratio (logit / probit) ===
~1.60 for all coefficients (confirming the logit/probit scaling rule)
VariableLogit CoeffProbit CoeffRatio
educ0.15350.09481.62
children-0.2485-0.15381.62
married0.29800.18421.62

The logit/probit coefficient ratio is approximately 1.6, as theory predicts.

Concept Check

The logit coefficient on education is 0.15. How do you interpret this number?


Step 4: Compute Marginal Effects

Marginal effects translate logit/probit coefficients into interpretable probability changes.

# Average Marginal Effects using margins package
ame_logit <- margins(logit)
summary(ame_logit)

ame_probit <- margins(probit)
summary(ame_probit)

# Compare AMEs across models
cat("\n=== Comparison of AMEs on educ ===\n")
cat("LPM:", coef(lpm)["educ"], "\n")
cat("Logit AME:", summary(ame_logit)[summary(ame_logit)$factor == "educ", "AME"], "\n")
cat("Probit AME:", summary(ame_probit)[summary(ame_probit)$factor == "educ", "AME"], "\n")
Requiresmargins

Expected output:

=== Average Marginal Effects (Logit) ===
VariableAME (Logit)AME (Probit)LPM Coeff
age0.00880.00870.0085
age_sq-0.0001-0.0001-0.0001
educ0.03320.03300.0335
married0.06450.06400.0652
children-0.0538-0.0535-0.0548
spouse_income-0.000002-0.000002-0.000002
=== Comparison of AMEs ===

The AMEs from logit, probit, and LPM are nearly identical (within 0.005 of each other). An additional year of education raises participation probability by about 3.3 percentage points. An additional child reduces it by about 5.4 percentage points.


Step 5: Predicted Probabilities

Generating predicted probabilities for specific profiles is essential for communicating your results.

# Predicted probability for specific profiles
profiles <- data.frame(
age = c(25, 25, 45, 45),
age_sq = c(625, 625, 2025, 2025),
educ = c(12, 16, 12, 16),
married = c(0, 0, 1, 1),
children = c(0, 0, 2, 2),
spouse_income = c(0, 0, 40000, 40000)
)

profiles$pred_logit <- predict(logit, profiles, type = "response")
profiles$pred_probit <- predict(probit, profiles, type = "response")

print(profiles[, c("age", "educ", "married", "pred_logit", "pred_probit")])

# Plot predicted probability vs education
educ_range <- 6:21
pred_df <- data.frame(age = 40, age_sq = 1600, educ = educ_range,
                     married = 0, children = 1, spouse_income = 0)
pred_df$prob <- predict(logit, pred_df, type = "response")

plot(educ_range, pred_df$prob, type = "l", lwd = 2, col = "blue",
   xlab = "Years of Education", ylab = "Predicted Probability",
   main = "Predicted LFP Probability by Education")

Expected output:

Predicted Probabilities:
AgeEducMarriedChildrenPred (Logit)Pred (Probit)
2512000.5420.540
2516000.7180.715
4512120.5350.532
4516120.7080.705

Step 6: Goodness of Fit

# Pseudo R-squared (McFadden)
null_loglik <- logLik(glm(lfp ~ 1, data = df, family = binomial))
cat("McFadden Pseudo R-squared:", 1 - logLik(logit)/null_loglik, "\n")

# AIC / BIC
cat("Logit AIC:", AIC(logit), " BIC:", BIC(logit), "\n")
cat("Probit AIC:", AIC(probit), " BIC:", BIC(probit), "\n")

# Classification accuracy
pred_class <- as.integer(predict(logit, type = "response") > 0.5)
accuracy <- mean(pred_class == df$lfp)
cat("\nClassification accuracy:", accuracy, "\n")

# Confusion matrix
table(Actual = df$lfp, Predicted = pred_class)

# ROC AUC (requires pROC)
# library(pROC)
# roc_obj <- roc(df$lfp, predict(logit, type = "response"))
# auc(roc_obj)
RequirespROC

Expected output:

Logit pseudo R-squared:  0.0852
Probit pseudo R-squared: 0.0848

Logit  AIC: 3584.4  BIC: 3627.5
Probit AIC: 3585.1  BIC: 3628.2

Classification accuracy (logit): 0.682

Confusion Matrix:
            Pred 0    Pred 1
Actual 0       652       514
Actual 1       442      1392

ROC AUC: 0.7215
MetricLogitProbit
Pseudo R-sq0.0850.085
AIC3584.43585.1
BIC3627.53628.2
Accuracy68.2%68.0%
ROC AUC0.7220.720
Pred 0Pred 1
Actual 0652 (TN)514 (FP)
Actual 1442 (FN)1,392 (TP)

Logit and probit yield virtually identical fit. The model correctly classifies about 68% of observations, with an AUC of ~0.72 indicating good discrimination.

Concept Check

Your logit model has a pseudo R-squared of 0.15. Is this a 'bad' model?


Step 7: Exercises

Try these on your own:

  1. Odds ratios. Exponentiate the logit coefficients to obtain odds ratios. Interpret the odds ratio on children in plain language.

  2. Interaction effects. Add an interaction between married and children. Compute the marginal effect of an additional child for married vs. unmarried individuals. (Warning: interaction effects in nonlinear models require careful treatment — read Ai and Norton 2003.)

  3. Multinomial logit. Extend the model to three outcomes: not in labor force, employed part-time, employed full-time. Use statsmodels.discrete.discrete_model.MNLogit (Python), nnet::multinom (R), or mlogit (Stata).

  4. Out-of-sample prediction. Split the data 80/20 and evaluate the model's out-of-sample AUC. Does overfitting appear to be a problem?


Summary

In this lab you learned:

  • Logit and probit model binary outcomes using S-shaped link functions that keep predicted probabilities in [0,1]
  • The raw coefficients are in log-odds (logit) or z-score (probit) units — in most cases, marginal effects are reported instead
  • Average marginal effects (AMEs) are generally preferred over marginal effects at the mean (MEMs)
  • Logit and probit typically produce very similar marginal effects in practice
  • Pseudo R-squared values are not comparable to OLS R-squared; focus on meaningful marginal effects and adequate discrimination (AUC)
  • The linear probability model remains a useful quick check but has known limitations at the tails