MethodAtlas
Discrete ChoiceEstablished

Logit / Probit

Models for binary outcomes — when your dependent variable is yes/no, pass/fail, or adopt/don't adopt.

Quick Reference

When to Use
When your outcome variable is binary (0/1, yes/no, adopt/don't adopt) and the linear probability model is inadequate, especially when predicted probabilities near 0 or 1 matter.
Key Assumption
Correct specification of the link function (logistic or normal CDF). For causal interpretation, the same exogeneity condition as OLS applies.
Common Mistake
Reporting logit coefficients as if they were marginal effects — logit coefficients are in log-odds units, not probability units. Computing and reporting average marginal effects is standard practice.
Estimated Time
2.5 hours

One-Line Implementation

Stata: logit y x1 x2, vce(robust)
R: margins::margins(glm(y ~ x1 + x2, family = binomial(link = 'logit'), data = df))
Python: smf.logit('y ~ x1 + x2', data=df).fit().get_margeff().summary()

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Firm Adoption of a New Technology

Imagine you are studying why some firms adopt a new manufacturing technology and others do not. Your outcome variable is binary: Yi=1Y_i = 1 if firm ii adopts, Yi=0Y_i = 0 otherwise. You want to know how firm size, R&D spending, and industry competition affect the probability of adoption.

You could try running OLS — regressing the 0/1 outcome on your covariates. This approach is called the linear probability model (LPM), and it is a reasonable starting point. But it has problems. The predicted probabilities can fall outside [0,1][0, 1], the error term is necessarily heteroskedastic, and the marginal effect of a covariate is assumed to be constant regardless of where you are on the probability scale.

Logit and probit models address these problems by modeling the probability through a nonlinear link function that keeps predictions bounded between 0 and 1.


A. Overview: Binary Outcome Models

The Problem with OLS on Binary Outcomes

When you run Yi=β0+β1Xi+εiY_i = \beta_0 + \beta_1 X_i + \varepsilon_i with Yi{0,1}Y_i \in \{0, 1\}, you are modeling:

E[YiXi]=P(Yi=1Xi)=β0+β1XiE[Y_i | X_i] = P(Y_i = 1 | X_i) = \beta_0 + \beta_1 X_i

This equation is the LPM. It works surprisingly well in many cases, especially near the center of the data. But at the extremes, it can predict probabilities below 0 or above 1, which is nonsensical.

Link Functions: The Core Idea

Both logit and probit model the probability through a nonlinear transformation:

P(Yi=1Xi)=G(Xiβ)P(Y_i = 1 | X_i) = G(X_i'\beta)

where G()G(\cdot) is a function that maps any real number to the (0,1)(0, 1) interval.

  • Logit uses the logistic function: G(z)=ez1+ez=Λ(z)G(z) = \frac{e^z}{1 + e^z} = \Lambda(z)
  • Probit uses the standard normal CDF: G(z)=Φ(z)G(z) = \Phi(z)

Both are S-shaped curves. They are nearly identical in practice — probit is slightly steeper at the center and slightly thinner at the tails. In most applications, they give very similar results.

When Does It Matter Which You Choose?

In most applications, it does not. The choice between logit and probit rarely changes substantive conclusions. Logit is more common in epidemiology and management because of the convenient odds ratio interpretation. Probit is more common in economics, partly by convention and partly because it connects naturally to latent variable models.


Common Confusions


B. Identification

The identification strategy for logit/probit is the same as for OLS: you need exogeneity of the regressors. The logit/probit framework does not solve problems — it just handles the functional form for binary outcomes.

E[εiXi]=0E[\varepsilon_i | X_i] = 0

If your regressors are endogenous, you need an identification strategy (IV, DiD, matching, etc.) combined with the appropriate binary outcome model. For IV with binary outcomes, see the bivariate probit or IV-probit approach. It is also advisable to consider sensitivity analysis to assess how robust your estimates are to potential unobserved confounders.

The Latent Variable Interpretation

Both models can be motivated by a latent variable YiY_i^*:

Yi=Xiβ+εi,Yi=1(Yi>0)Y_i^* = X_i'\beta + \varepsilon_i, \quad Y_i = \mathbf{1}(Y_i^* > 0)

If εi\varepsilon_i follows a logistic distribution, you get logit. If εi\varepsilon_i follows a standard normal, you get probit. The firm adopts the technology when the latent net benefit exceeds zero.


C. Visual Intuition

Think of the probability curve as a hill. At the bottom (low probability of adoption), even a large change in firm size barely moves the probability — you are pushing against inertia. At the top (high probability), the same is true — most firms have already adopted. The steepest part of the hill is in the middle, around 50% probability. This middle region is where a change in X has the biggest effect on the probability.

This nonlinearity is why marginal effects depend on where you evaluate them. A one-unit increase in firm size might raise adoption probability by 8 percentage points for a mid-sized firm (on the steep part of the curve) but only 2 percentage points for a very large firm (on the flat part).

Interactive Simulation

Logit Marginal Effects

The marginal effect of X on P(Y=1) is not constant in logit: it peaks near the 50% baseline probability and shrinks toward zero at the extremes, unlike OLS where the marginal effect equals the coefficient everywhere.

04.158.2912.44Simulated ValueLogit Coeff…Baseline Lo…Var 3Parameters

Computed Results

Baseline Probability P(Y=1)
0.500
Marginal Effect (AME at baseline)
0.375
Peak ME (at p = 0.5, = β/4)
0.375
0.15
-44
Interactive Simulation

Why Logit / Probit?

Binary DGP: P(Y=1|X) = sigmoid(0.0 + 1.5 · X). N = 200. Comparing average marginal effects (AMEs) across estimators. LPM produces 34 predictions outside [0, 1].

010.000.250.500.751.00Predictor (X)P(Y = 1 | X)
LPMLogitProbitTrue DGP

Estimation Results

Estimatorβ̂SE95% CIBias
LPMclosest0.1380.010[0.12, 0.16]-0.000
Logit0.1360.019[0.10, 0.17]-0.002
Probit0.1350.016[0.10, 0.17]-0.003
True β0.138
200

Number of observations

1.5

Coefficient in the latent index (steeper = more extreme probabilities)

0.0

Shifts the probability curve left/right

Why the difference?

The Linear Probability Model predicts outside [0, 1] for 34 of 200 observations (18 below 0, 16 above 1). These nonsensical probabilities are a fundamental problem with applying OLS to binary outcomes. On the average marginal effect (AME) scale, logit recovers the true AME more accurately here because the DGP uses a logistic link. Both logit and probit correctly bound predictions to [0, 1] and model the inherent nonlinearity of binary outcomes. The table compares AMEs rather than raw coefficients, since the logit slope (log-odds), probit slope (latent-index), and LPM slope (linear probability) are not on the same scale. AMEs express each estimator's effect as the average change in P(Y=1) for a unit increase in X.


D. Mathematical Derivation

Don't worry about the notation yet — here's what this means in words: We find the coefficients that make the observed data most likely, by maximizing the probability of seeing the 1s and 0s we actually observe.

For a binary outcome Yi{0,1}Y_i \in \{0, 1\} with probability pi=P(Yi=1Xi)=Λ(Xiβ)p_i = P(Y_i = 1 | X_i) = \Lambda(X_i'\beta), the likelihood for observation ii is:

Li(β)=piYi(1pi)1YiL_i(\beta) = p_i^{Y_i} (1 - p_i)^{1 - Y_i}

The log-likelihood for the full sample is:

(β)=i=1n[Yiln(Λ(Xiβ))+(1Yi)ln(1Λ(Xiβ))]\ell(\beta) = \sum_{i=1}^{n} \left[ Y_i \ln(\Lambda(X_i'\beta)) + (1 - Y_i) \ln(1 - \Lambda(X_i'\beta)) \right]

Taking the derivative and using the fact that Λ(z)=Λ(z)(1Λ(z))\Lambda'(z) = \Lambda(z)(1 - \Lambda(z)):

β=i=1n(YiΛ(Xiβ))Xi=0\frac{\partial \ell}{\partial \beta} = \sum_{i=1}^{n} (Y_i - \Lambda(X_i'\beta)) X_i = 0

This likelihood has no closed-form solution and must be solved numerically via iteratively reweighted least squares (IRLS) or Newton-Raphson.

Marginal effects: The partial effect of XjX_j on the probability is:

P(Yi=1Xi)Xij=Λ(Xiβ)(1Λ(Xiβ))βj=λ(Xiβ)βj\frac{\partial P(Y_i = 1 | X_i)}{\partial X_{ij}} = \Lambda(X_i'\beta)(1 - \Lambda(X_i'\beta)) \cdot \beta_j = \lambda(X_i'\beta) \cdot \beta_j

This expression depends on XiX_i, which is why you must evaluate it at specific values or average it across the sample.


E. Implementation

library(margins)

# Logit
logit_fit <- glm(adopt ~ firm_size + rd_spending + competition,
               family = binomial(link = "logit"), data = df)
summary(logit_fit)

# Average marginal effects
ame <- margins(logit_fit)
summary(ame)

# Odds ratios
exp(coef(logit_fit))
exp(confint(logit_fit))

# Probit
probit_fit <- glm(adopt ~ firm_size + rd_spending + competition,
                family = binomial(link = "probit"), data = df)
summary(margins(probit_fit))
Requiresmargins

F. Diagnostics and Model Fit

Pseudo R-Squared

There is no true R2R^2 for logit/probit. McFadden's pseudo-R2R^2 compares the log-likelihood of your model to a null model (intercept only):

Pseudo-R2=1(β^)(β^0)\text{Pseudo-}R^2 = 1 - \frac{\ell(\hat{\beta})}{\ell(\hat{\beta}_0)}

Values above 0.2 are considered quite good. Do not compare pseudo-R2R^2 values across different link functions.

Classification Table

Predict Y^i=1\hat{Y}_i = 1 if p^i>c\hat{p}_i > c (usually c=0.5c = 0.5) and compute the confusion matrix. Report sensitivity (true positive rate), specificity (true negative rate), and overall accuracy. But be cautious: classification accuracy is sensitive to class imbalance.

Hosmer-Lemeshow Test

Groups observations into deciles of predicted probability and tests whether observed frequencies match predicted frequencies. A significant test suggests poor calibration, but the test has low power and is sensitive to the number of groups.


Interpreting Results

Three Ways to Report Logit Results

  1. Log-odds coefficients — the raw output. Hard to interpret; mainly useful for checking sign and significance.
  2. Odds ratioseβje^{\beta_j}. "A one-unit increase in X multiplies the odds of Y=1 by eβje^{\beta_j}." Common in epidemiology and management.
  3. Marginal effects — the change in probability. Most intuitive. Preferred in economics.

G. What Can Go Wrong

ProblemWhat It DoesHow to Fix It
Reporting coefficients as marginal effectsOverstates/understates the effectCompute and report AMEs
Perfect separationMLE does not converge; coefficients explode to infinityDrop the problematic variable, use penalized likelihood (Firth logit), or combine categories
Rare eventsFinite-sample bias in coefficient estimates when Y=1 is very rare (under 5%)Use rare-events logit (King & Zeng, 2001) or exact logit
Ignoring heteroskedasticityStandard errors are wrongUse robust SEs
Comparing coefficients across modelsLogit coefficients are not comparable across models with different covariates (rescaling problem)Compare marginal effects instead
(Allison, 1999)
Assumption Failure Demo

Interpreting Logit Coefficients as Marginal Effects

Researcher computes average marginal effects after logit estimation

AME of firm size on adoption probability: 0.05 (SE = 0.017). A one-unit increase in firm size raises the probability of adoption by about 5 percentage points on average.

Assumption Failure Demo

Perfect Separation

All covariate values have some variation in the outcome — both 0s and 1s appear at every level of X

Logit converges normally. Coefficient on industry dummy: 1.8 (SE = 0.4). MLE is well-defined and standard errors are reliable.

Assumption Failure Demo

Comparing Logit Coefficients Across Models

Researcher compares average marginal effects across a baseline model and a model with additional controls

AME of R&D on adoption: 0.08 (baseline model) vs. 0.06 (with controls). The 2 percentage point decrease suggests modest confounding by the added covariates.

Concept Check

A logit regression of firm adoption on firm size produces a coefficient of 0.3 with robust SE 0.1. The average marginal effect is 0.05. How do you interpret the result?


H. Practice

Concept Check

A researcher runs a logit model and a probit model on the same data. The logit coefficient on firm size is 0.48 and the probit coefficient is 0.28. She concludes that the logit model estimates a much larger effect. Is she correct?

Concept Check

A logit model predicting loan default produces a coefficient of -0.8 on credit score (standardized). The odds ratio is exp(-0.8) = 0.45. A manager asks: 'So a one-SD increase in credit score cuts the default probability in half?' Is the manager correct?

Concept Check

A colleague says: 'I always use logit for binary outcomes because OLS can predict probabilities outside [0,1].' When might the linear probability model (LPM) actually be a reasonable choice?

Concept Check

You add an interaction term (firm_size * rd_spending) to a logit model. The coefficient on the interaction is 0.15 (p = 0.03). A reviewer says you cannot interpret the interaction effect by looking at this coefficient alone. Why?

Guided Exercise

Interpreting Logit Results: Loan Default Prediction

A bank analyst runs a logit regression to predict whether a small business loan will default. The dependent variable is Default (1 = defaulted, 0 = repaid). The key predictor is Years_in_business (continuous). The estimated logit coefficient is -0.4 and the average marginal effect is -0.06. The baseline default probability in the sample is 20%.

In what units is the logit coefficient (-0.4) expressed?

How do you interpret the average marginal effect of -0.06?

If a colleague says the odds of default decrease by 40% per additional year, are they correct?

Why can you not interpret the logit coefficient directly as a probability change?

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies whether receiving venture capital funding affects the probability that a startup goes public (IPO). They run a logit regression of IPO (0/1) on VC_funded (0/1), controlling for firm age, industry, and founder experience. They report: 'The coefficient on VC_funded is 1.2 (p < 0.01), meaning that VC funding increases the probability of IPO by 120 percentage points.'

Select all errors you can find:

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors study whether firms with female CEOs are more likely to adopt environmental sustainability practices. Using a cross-section of 3,200 publicly traded firms, they run a logit regression of sustainability adoption (0/1) on a female CEO dummy, controlling for firm size (log revenue), industry dummies, ROA, and firm age. They report that the odds ratio on female CEO is 1.85 (p = 0.002) and conclude that female leadership causes firms to be 85% more likely to adopt sustainability practices.

Key Table

VariableOdds RatioRobust SEp-value
Female CEO1.850.350.002
Log(Revenue)1.420.080.000
ROA1.100.220.640
Firm age1.010.0030.001
Industry FEYes
Pseudo R-squared0.18
N3,200

Authors' Identification Claim

By controlling for firm size, profitability, firm age, and industry, we isolate the independent effect of CEO gender on sustainability adoption.


I. Swap-In: When to Use Something Else

  • Linear Probability Model (LPM): If your probabilities are between 0.2 and 0.8 for most observations, the LPM with robust SEs often gives nearly identical average marginal effects. Easier to interpret and to combine with FE or IV.
  • Conditional logit (fixed effects logit): For panel data with unit fixed effects. Only uses within-unit variation. See Chamberlain (1980).
  • Multinomial logit: When the outcome has more than two unordered categories.
  • Ordered logit/probit: When the outcome has ordered categories (e.g., strongly disagree to strongly agree).
  • Count models: When the outcome is a non-negative integer (number of events), see Poisson / Negative Binomial instead.

J. Reviewer Checklist

Critical Reading Checklist


Paper Library

Foundational (6)

McFadden, D. (1974). Conditional Logit Analysis of Qualitative Choice Behavior.

Frontiers in Econometrics

McFadden developed the conditional logit model grounded in random utility theory, showing how discrete choices among alternatives can be modeled by assuming individuals maximize utility with an extreme-value distributed error. This work earned him the 2000 Nobel Prize and remains the foundation of discrete choice analysis.

Amemiya, T. (1981). Qualitative Response Models: A Survey.

Journal of Economic Literature

Amemiya provided a comprehensive survey of qualitative response models including logit, probit, and tobit. This survey organized the theoretical properties, estimation methods, and specification tests for binary and multinomial choice models and became a standard reference for applied researchers.

Hausman, J., & McFadden, D. (1984). Specification Tests for the Multinomial Logit Model.

EconometricaDOI: 10.2307/1910997

This paper developed a specification test for the independence of irrelevant alternatives (IIA) assumption in multinomial logit. The test allows researchers to assess whether the logit model's restrictive substitution patterns are appropriate for their data, which is critical for applied work with multiple choice categories.

Ai, C., & Norton, E. C. (2003). Interaction Terms in Logit and Probit Models.

Economics LettersDOI: 10.1016/S0165-1765(03)00032-6

Ai and Norton showed that the interpretation of interaction terms in nonlinear models like logit and probit is much more complicated than in linear models. The marginal effect of an interaction is not simply the coefficient on the interaction term, a mistake that was widespread in applied research.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.

Wooldridge's graduate textbook provides a comprehensive and rigorous treatment of logit, probit, and other discrete choice models in both cross-sectional and panel data settings. Chapters 15–16 cover binary response models, multinomial models, and the econometric issues specific to nonlinear estimation with unobserved heterogeneity.

Chamberlain, G. (1980). Analysis of Covariance with Qualitative Data.

Review of Economic StudiesDOI: 10.2307/2297110

Chamberlain showed how to incorporate fixed effects into logit models via conditional maximum likelihood, which is essential for panel data applications where unobserved unit-level heterogeneity must be controlled for.

Application (4)

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke argue that for causal inference purposes, the linear probability model (OLS on a binary outcome) is often preferable to logit or probit because it avoids functional form assumptions and yields easily interpretable coefficients. This influential perspective has shifted many applied researchers toward LPM.

Hoetker, G. (2007). The Use of Logit and Probit Models in Strategic Management Research: Critical Issues.

Strategic Management JournalDOI: 10.1002/smj.582

Hoetker reviewed how strategy researchers use logit and probit models and identified common pitfalls, including misinterpretation of coefficients across groups and incorrect use of interaction terms. This paper provided concrete guidance for improving practice in management journals.

Zelner, B. A. (2009). Using Simulation to Interpret Results from Logit, Probit, and Other Nonlinear Models.

Strategic Management JournalDOI: 10.1002/smj.783

Zelner advocated using simulation-based approaches to interpret and present results from nonlinear models in management research. By computing predicted probabilities and marginal effects via simulation, researchers can convey substantive significance more clearly than raw coefficients.

Palepu, K. G. (1986). Predicting Takeover Targets: A Methodological and Empirical Analysis.

Journal of Accounting and EconomicsDOI: 10.1016/0165-4101(86)90008-X

Palepu used logit models to predict which firms would become takeover targets based on financial and market characteristics. This influential paper demonstrated the practical application of binary choice models to corporate strategy and governance questions.

Survey (3)

Train, K. E. (2009). Discrete Choice Methods with Simulation.

Cambridge University PressDOI: 10.1017/CBO9780511805271

Train's textbook provides a comprehensive and accessible treatment of logit, probit, mixed logit, and other discrete choice models. It covers both theory and practical simulation-based estimation methods and is widely used in economics, marketing, and transportation research.

Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications.

Cambridge University Press

Chapters 14–15 offer comprehensive coverage of binary and multinomial choice models, with detailed discussion of estimation and specification testing.

Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables.

Sage Publications

A widely used reference for applied researchers working with binary, ordinal, multinomial, and count outcome models, with clear exposition of interpretation and software implementation.

Tags

discrete-choicebinary-outcomecross-sectional