Lab·tutorial·9 min read

tutorial90 minutes

Lab: Logit and Probit Models

Estimate binary choice models: fit logit and probit, compute and interpret marginal effects, predict probabilities, compare specifications, assess fit.

Method: Logit / Probit
Languages: Python, R, Stata
Dataset: Simulated labor force participation data

Overview

In this lab you will model labor force participation as a binary outcome using logit and probit regressions. You will learn why OLS (the linear probability model) has limitations for binary outcomes, how maximum likelihood estimation works conceptually, and — most importantly — how to compute and interpret marginal effects, which are what matter for substantive conclusions.

What you will learn:

Why the linear probability model can produce impossible predictions
How to estimate logit and probit models and read the output
How to compute average marginal effects (AMEs) and marginal effects at the mean (MEMs)
How to generate predicted probabilities for specific covariate profiles
How to compare models using pseudo-R-squared, AIC, and classification accuracy

Prerequisites: Familiarity with OLS regression and the concept of probability.

Step 1: Simulate Labor Force Participation Data

We simulate 3,000 individuals with characteristics that determine whether they participate in the labor force.

1# First-time setup: install.packages(c("estimatr", "margins", "modelsummary"))
2library(estimatr)
3library(margins)
4library(modelsummary)
5
6# Set seed for reproducibility
7set.seed(42)
8n <- 3000
9
10# Generate covariates with realistic bounds
11age <- round(pmin(pmax(rnorm(n, 40, 12), 18), 65))
12educ <- round(pmin(pmax(rnorm(n, 13, 3), 6), 22))
13married <- rbinom(n, 1, 0.55)
14children <- pmin(rpois(n, 1.2), 5)
15# Spouse income is zero for unmarried individuals by construction
16spouse_income <- married * rexp(n, rate = 1/30000)
17
18# Latent index: determines the log-odds of participation
19xb <- -3.5 + 0.04 * age - 0.0006 * age^2 + 0.15 * educ -
20    0.25 * children - 0.00001 * spouse_income + 0.3 * married
21
22# Apply logistic CDF to convert latent index to probability
23prob <- 1 / (1 + exp(-xb))
24# Draw binary outcome from Bernoulli distribution
25lfp <- rbinom(n, 1, prob)
26
27df <- data.frame(lfp, age, age_sq = age^2, educ,
28               married, children, spouse_income)
29
30cat("Participation rate:", mean(lfp), "\n")
31summary(df)

Requiresestimatr margins modelsummary

Expected output:

Participation rate: 0.612

Variable	mean	std	min	25%	50%	75%	max
lfp	0.61	0.49	0.00	0.00	1.00	1.00	1.00
age	40.12	11.52	18.00	31.00	40.00	49.00	65.00
educ	13.05	2.85	6.00	11.00	13.00	15.00	22.00
married	0.55	0.50	0.00	0.00	1.00	1.00	1.00
children	1.18	1.05	0.00	0.00	1.00	2.00	5.00
spouse_income	16,425	22,510	0.00	0.00	0.00	28,150	152,000

About 61% of the sample participates in the labor force.

Expected output: First 5 rows of simulated data

Row	lfp	age	age_sq	educ	married	children	spouse_income
1	1	46	2116	15	1	2	18,523
2	0	38	1444	11	0	1	0
3	1	48	2304	15	1	0	45,210
4	1	58	3364	17	1	1	12,087
5	0	25	625	9	0	3	0

The data includes a mix of participants (lfp=1) and non-participants (lfp=0) with varying demographic profiles. Note that spouse_income is zero for all unmarried individuals by construction.

Step 2: The Linear Probability Model (Baseline)

Start with OLS on the binary outcome to see its limitations.

1# Linear Probability Model with heteroscedasticity-robust SEs
2# HC2 standard errors are recommended because binary outcomes
3# have inherent heteroscedasticity: Var(y|X) = p(1-p)
4lpm <- lm_robust(lfp ~ age + age_sq + educ + married + children + spouse_income,
5                data = df, se_type = "HC2")
6summary(lpm)
7
8# Check for impossible predictions outside [0, 1]
9# This is the key limitation of the LPM for binary outcomes
10pred_lpm <- predict(lpm, df)
11cat("Predictions < 0:", sum(pred_lpm < 0), "\n")
12cat("Predictions > 1:", sum(pred_lpm > 1), "\n")
13cat("Range:", round(min(pred_lpm), 3), "to", round(max(pred_lpm), 3), "\n")

Expected output:

Variable	Coeff	Robust SE	t	p
Intercept	-0.0845	0.089	-0.95	0.342
age	0.0085	0.003	2.83	0.005
age_sq	-0.0001	0.000	-3.62	0.000
educ	0.0335	0.003	11.17	0.000
married	0.0652	0.022	2.96	0.003
children	-0.0548	0.008	-6.85	0.000
spouse_income	-0.000002	0.000	-3.45	0.001

Predictions < 0: 18
Predictions > 1: 12
Range: [-0.085, 1.062]

The LPM produces 30 predictions outside the [0, 1] range, illustrating its key limitation for binary outcomes.

Expected output: LPM predicted probability distribution

Distribution of LPM fitted values:

Statistic	Value
Mean	0.612
Std Dev	0.150
Min	-0.085
Max	1.062
% below 0	0.6% (18 obs)
% above 1	0.4% (12 obs)
Within [0,1]	99.0% (2,970 obs)

The out-of-range predictions cluster among individuals with extreme covariate values: young, low-education, unmarried individuals with many children (predicted probability below 0), and older, high-education, married individuals with no children (predicted probability above 1).

Step 3: Estimate Logit and Probit Models

1# Logit model: uses the logistic CDF as the link function
2logit <- glm(lfp ~ age + age_sq + educ + married + children + spouse_income,
3           data = df, family = binomial(link = "logit"))
4
5# Probit model: uses the standard normal CDF as the link function
6probit <- glm(lfp ~ age + age_sq + educ + married + children + spouse_income,
7            data = df, family = binomial(link = "probit"))
8
9summary(logit)
10summary(probit)
11
12# Logit coefficients are approximately 1.6–1.8x probit coefficients
13# because the logistic distribution has SD pi/sqrt(3) vs 1 for normal
14cat("\nCoefficient ratio (logit/probit):\n")
15print(round(coef(logit) / coef(probit), 2))

Expected output:

=== Logit Coefficients ===

Variable	Coeff	SE	z	p
Intercept	-3.4520	0.468	-7.38	0.000
age	0.0412	0.017	2.42	0.015
age_sq	-0.0006	0.000	-3.15	0.002
educ	0.1535	0.017	9.03	0.000
married	0.2980	0.112	2.66	0.008
children	-0.2485	0.042	-5.92	0.000
spouse_income	-0.000010	0.000	-3.25	0.001

Pseudo R-squared: 0.0852
Log-likelihood: -1785.2

=== Coefficient Ratio (logit / probit) ===
~1.7 for all coefficients (confirming the logit/probit scaling rule)

Variable	Logit Coeff	Probit Coeff	Ratio
educ	0.1535	0.0900	1.71
children	-0.2485	-0.1460	1.70
married	0.2980	0.1750	1.70

The logit/probit coefficient ratio is approximately 1.7, close to the theoretical value of π/√3 ≈ 1.81.

Expected output: Odds ratios from logit model

Odds Ratios (exponentiated logit coefficients):

Variable	Coeff	Odds Ratio	Interpretation
educ	0.1535	1.166	Each additional year of education increases the odds of participation by 16.6%
married	0.2980	1.347	Being married increases the odds of participation by 34.7%
children	-0.2485	0.780	Each additional child reduces the odds of participation by 22.0%
spouse_income	-0.000010	1.000	Each additional dollar of spouse income has a negligible effect on odds

The odds ratio for children (0.78) means that each additional child multiplies the odds of labor force participation by 0.78, reducing them by about 22%.

Expected output: Probit coefficients

Variable	Probit Coeff	SE	z	p
Intercept	-2.0300	0.275	-7.38	0.000
age	0.0242	0.010	2.42	0.015
age_sq	-0.0004	0.000	-3.15	0.002
educ	0.0900	0.010	9.00	0.000
married	0.1750	0.066	2.65	0.008
children	-0.1460	0.025	-5.84	0.000
spouse_income	-0.000006	0.000	-3.25	0.001

Pseudo R-squared: 0.0848

Concept Check

The logit coefficient on education is 0.15. How do you interpret this number?

A one-year increase in education raises the probability of labor force participation by 15 percentage points.A one-year increase in education raises the log-odds of labor force participation by 0.15, holding other variables constant.The odds ratio is 0.15.Education has a 15% effect on participation.

Step 4: Compute Marginal Effects

Marginal effects translate logit/probit coefficients into interpretable probability changes.

1# Average Marginal Effects (AME): compute the marginal effect
2# for each observation, then average across the sample
3ame_logit <- margins(logit)
4summary(ame_logit)
5
6ame_probit <- margins(probit)
7summary(ame_probit)
8
9# Compare AMEs across all three models
10# AMEs should be similar because all models approximate the same DGP
11cat("\n=== Comparison of AMEs on educ ===\n")
12cat("LPM:", coef(lpm)["educ"], "\n")
13cat("Logit AME:", summary(ame_logit)[summary(ame_logit)$factor == "educ", "AME"], "\n")
14cat("Probit AME:", summary(ame_probit)[summary(ame_probit)$factor == "educ", "AME"], "\n")

Requiresmargins

Expected output:

=== Average Marginal Effects (Logit) ===

Variable	AME (Logit)	AME (Probit)	LPM Coeff
age	0.0088	0.0087	0.0085
age_sq	-0.0001	-0.0001	-0.0001
educ	0.0332	0.0330	0.0335
married	0.0645	0.0640	0.0652
children	-0.0538	-0.0535	-0.0548
spouse_income	-0.000002	-0.000002	-0.000002

=== Comparison of AMEs ===

The AMEs from logit, probit, and LPM are nearly identical (within 0.005 of each other). An additional year of education raises participation probability by about 3.3 percentage points. An additional child reduces it by about 5.4 percentage points.

Expected output: Full AME table with standard errors

Average Marginal Effects (Logit) — detailed output:

Variable	AME	Std Err	z	p	95% CI
age	0.0088	0.0030	2.93	0.003	[0.0029, 0.0147]
age_sq	-0.0001	0.0000	-3.52	0.000	[-0.0002, -0.0001]
educ	0.0332	0.0030	11.07	0.000	[0.0273, 0.0391]
married	0.0645	0.0218	2.96	0.003	[0.0218, 0.1072]
children	-0.0538	0.0079	-6.81	0.000	[-0.0693, -0.0383]
spouse_income	-0.000002	0.0000	-3.42	0.001	[-0.000004, -0.000001]

Marginal Effects at the Mean (MEM) comparison:

Variable	MEM	AME	Difference
educ	0.0360	0.0332	0.003
married	0.0702	0.0645	0.006
children	-0.0582	-0.0538	0.004

MEM values are slightly larger in magnitude than AMEs because the logistic function has a steeper slope near its center (where the "mean person" often falls) than at the tails where some observations lie.

Step 5: Predicted Probabilities

Generating predicted probabilities for specific profiles is important for communicating your results.

1# Predicted probability for specific profiles
2profiles <- data.frame(
3age = c(25, 25, 45, 45),
4age_sq = c(625, 625, 2025, 2025),
5educ = c(12, 16, 12, 16),
6married = c(0, 0, 1, 1),
7children = c(0, 0, 2, 2),
8spouse_income = c(0, 0, 40000, 40000)
9)
10
11profiles$pred_logit <- predict(logit, profiles, type = "response")
12profiles$pred_probit <- predict(probit, profiles, type = "response")
13
14print(profiles[, c("age", "educ", "married", "pred_logit", "pred_probit")])
15
16# Plot predicted probability vs education
17educ_range <- 6:21
18pred_df <- data.frame(age = 40, age_sq = 1600, educ = educ_range,
19                     married = 0, children = 1, spouse_income = 0)
20pred_df$prob <- predict(logit, pred_df, type = "response")
21
22plot(educ_range, pred_df$prob, type = "l", lwd = 2, col = "blue",
23   xlab = "Years of Education", ylab = "Predicted Probability",
24   main = "Predicted LFP Probability by Education")

Expected output:

Predicted Probabilities:

Age	Educ	Married	Children	Pred (Logit)	Pred (Probit)
25	12	0	0	0.542	0.540
25	16	0	0	0.718	0.715
45	12	1	2	0.535	0.532
45	16	1	2	0.708	0.705

Step 6: Goodness of Fit

1# First-time setup: install.packages(c("pROC"))
2# McFadden Pseudo R-squared: 1 - (log L_full / log L_null)
3# Compares the fitted model to an intercept-only baseline
4null_loglik <- logLik(glm(lfp ~ 1, data = df, family = binomial))
5cat("McFadden Pseudo R-squared:", 1 - logLik(logit)/null_loglik, "\n")
6
7# AIC/BIC: lower values indicate better fit (penalize complexity)
8cat("Logit AIC:", AIC(logit), " BIC:", BIC(logit), "\n")
9cat("Probit AIC:", AIC(probit), " BIC:", BIC(probit), "\n")
10
11# Classification accuracy: fraction correctly predicted at 0.5 threshold
12pred_class <- as.integer(predict(logit, type = "response") > 0.5)
13accuracy <- mean(pred_class == df$lfp)
14cat("\nClassification accuracy:", accuracy, "\n")
15
16# Confusion matrix: rows = actual, columns = predicted
17table(Actual = df$lfp, Predicted = pred_class)
18
19# ROC AUC (requires pROC)
20# library(pROC)
21# roc_obj <- roc(df$lfp, predict(logit, type = "response"))
22# auc(roc_obj)

RequirespROC

Expected output:

Logit pseudo R-squared:  0.0852
Probit pseudo R-squared: 0.0848

Logit  AIC: 3584.4  BIC: 3627.5
Probit AIC: 3585.1  BIC: 3628.2

Classification accuracy (logit): 0.682

Confusion Matrix:
            Pred 0    Pred 1
Actual 0       652       514
Actual 1       442      1392

ROC AUC: 0.7215

Metric	Logit	Probit
Pseudo R-sq	0.085	0.085
AIC	3584.4	3585.1
BIC	3627.5	3628.2
Accuracy	68.2%	68.0%
ROC AUC	0.722	0.720

	Pred 0	Pred 1
Actual 0	652 (TN)	514 (FP)
Actual 1	442 (FN)	1,392 (TP)

Logit and probit yield virtually identical fit. The model correctly classifies about 68% of observations, with an AUC of ~0.72 indicating good discrimination.

Expected visualization: ROC curve

The ROC (Receiver Operating Characteristic) curve plots the true positive rate (sensitivity, y-axis) against the false positive rate (1 - specificity, x-axis) for all possible classification thresholds.

Key features of the logit ROC curve:

The curve bows above the 45-degree diagonal reference line (which represents a random classifier)
AUC = 0.722, indicating the model has a 72.2% probability of ranking a randomly chosen participant higher than a randomly chosen non-participant
The curve rises steeply at low false positive rates, suggesting the model identifies the most likely participants effectively
The optimal threshold (Youden's J statistic) is approximately 0.58, slightly below the participation rate

Classification performance at selected thresholds:

Threshold	Sensitivity	Specificity	Accuracy
0.40	0.868	0.380	0.678
0.50	0.759	0.559	0.682
0.60	0.638	0.705	0.664
0.70	0.492	0.822	0.620

At the default 0.5 threshold, the model is better at identifying participants (sensitivity = 75.9%) than non-participants (specificity = 55.9%), reflecting the higher base rate of participation.

Concept Check

Your logit model has a pseudo R-squared of 0.15. Is this a 'bad' model?

Yes — an R-squared of 0.15 means the model explains only 15% of the variation.No — pseudo R-squared values are typically much lower than OLS R-squared. A value of 0.15 can represent a model with good discrimination and useful marginal effects.Cannot tell without comparing to a baseline model.Yes — you generally want to add more variables until pseudo R-squared exceeds 0.50.

Step 7: Exercises

Odds ratios. Exponentiate the logit coefficients to obtain odds ratios. Interpret the odds ratio on children in plain language.
Interaction effects. Add an interaction between married and children. Compute the marginal effect of an additional child for married vs. unmarried individuals. (Warning: interaction effects in nonlinear models require careful treatment — read Ai and Norton 2003.)
Multinomial logit. Extend the model to three outcomes: not in labor force, employed part-time, employed full-time. Use statsmodels.discrete.discrete_model.MNLogit (Python), nnet::multinom (R), or mlogit (Stata).
Out-of-sample prediction. Split the data 80/20 and evaluate the model's out-of-sample AUC. Does overfitting appear to be a problem?

Expected output

If your code runs correctly, expect to see:

Participation rate: Approximately 50–70% of the sample participates in the labor force
Logit coefficients: Education around 0.15 (log-odds), children around -0.25, with education statistically significant
Probit coefficients: Roughly logit / 1.8 (e.g., education around 0.09)
Average marginal effects: AME of education on participation probability around 0.03–0.04; AME of children around -0.05 to -0.06
Logit vs. probit AMEs: Nearly identical (within 0.005 of each other)
Pseudo R-squared: McFadden pseudo R-squared around 0.10–0.20 (this range is normal for logit/probit models)
LPM comparison: Linear probability model coefficients close to the logit/probit AMEs, with some predicted probabilities outside [0,1]
Classification accuracy: Around 65–75%, with ROC AUC around 0.70–0.80
Sample size: 3,000 observations throughout

Summary

In this lab you learned:

Logit and probit model binary outcomes using S-shaped link functions that keep predicted probabilities in [0,1]
The raw coefficients are in log-odds (logit) or z-score (probit) units — in most cases, marginal effects are reported instead
Average marginal effects (AMEs) are generally preferred over marginal effects at the mean (MEMs)
Logit and probit typically produce very similar marginal effects in practice
Pseudo R-squared values are not comparable to OLS R-squared; focus on meaningful marginal effects and adequate discrimination (AUC)
The linear probability model remains a useful quick check but has known limitations at the tails

Overview#

Step 1: Simulate Labor Force Participation Data#

Step 2: The Linear Probability Model (Baseline)#

Step 3: Estimate Logit and Probit Models#

Step 4: Compute Marginal Effects#

Step 5: Predicted Probabilities#

Step 6: Goodness of Fit#

Step 7: Exercises#

Summary#

Overview

Step 1: Simulate Labor Force Participation Data

Step 2: The Linear Probability Model (Baseline)

Step 3: Estimate Logit and Probit Models

Step 4: Compute Marginal Effects

Step 5: Predicted Probabilities

Step 6: Goodness of Fit

Step 7: Exercises

Summary