Instrumental Variables / 2SLS
Uses an external source of variation (instrument) that affects treatment but not the outcome directly.
Quick Reference
- When to Use
- When your key regressor is endogenous (correlated with the error term) and you have an instrument — a variable that affects the treatment but has no direct effect on the outcome.
- Key Assumption
- Relevance (instrument predicts the endogenous regressor, first-stage F > 10), exogeneity (instrument uncorrelated with the error), and the exclusion restriction (instrument affects the outcome only through the endogenous regressor). The exclusion restriction is untestable with a single instrument.
- Common Mistake
- Using a weak instrument (first-stage F < 10) without acknowledging the resulting bias toward OLS, or not reporting the first-stage F-statistic. Also, not recognizing that IV estimates LATE (the effect for compliers), not the ATE.
- Estimated Time
- 3.5 hours
One-Line Implementation
ivregress 2sls y x1 (treatment = instrument), first vce(robust)feols(y ~ x1 | 0 | treatment ~ instrument, data = df, vcov = 'HC1')IV2SLS(dependent=df['y'], exog=df[['const','x1']], endog=df['treatment'], instruments=df['instrument']).fit(cov_type='robust')Download Full Analysis Code
Complete scripts with diagnostics, robustness checks, and result export.
Motivating Example: Colonial Origins of Comparative Development
Why are some countries rich and others poor? Acemoglu et al. (2001) proposed that institutions — the rules governing economic activity — are the key driver. But institutions are endogenous: rich countries invest in better institutions, creating a classic chicken-and-egg problem.
Their solution was an instrument: settler mortality in the colonial era. The argument runs as follows:
- In places where European settlers faced high mortality (tropical diseases), colonizers set up extractive institutions designed to transfer wealth to the metropole.
- In places where settlers could survive (temperate climates), they created inclusive institutions with property rights and rule of law.
- These institutional differences persisted and shaped modern economic outcomes.
- Critically, settler mortality from centuries ago affects current GDP only through its effect on institutions — it has no direct effect on economic performance today.
This claim is the : the instrument (settler mortality) affects the outcome (GDP) only through the endogenous variable (institutions). If this holds, 2SLS can recover the causal effect of institutions on development. A related strategy that builds on this IV logic is the shift-share (Bartik) instrument, which interacts local exposure shares with national-level shocks to generate cross-sectional variation.
Whether the exclusion restriction actually holds in this case has been debated for two decades. That debate is itself a masterclass in IV methodology.
(Albouy, 2012)This strategy is the fundamental logic of instrumental variables: find an external source of variation that shifts the endogenous regressor without directly affecting the outcome, and use that variation to recover causal effects.
A. Overview
The Endogeneity Problem
Consider the regression:
If — the treatment or key regressor is correlated with the error — then OLS is biased and inconsistent. This endogeneity arises from , reverse causality, or measurement error. Standard sensitivity analysis techniques can quantify how severe confounding must be to explain the estimated relationship, but when confounding is clearly present, OLS adjustment alone is insufficient.
The IV Solution
An instrumental variable solves this endogeneity problem by isolating the part of that is uncorrelated with . The instrument must satisfy three conditions for consistent estimation, plus a fourth for the LATE interpretation:
- Relevance: — the instrument must actually affect the endogenous variable. This condition is testable.
- Independence (Exogeneity): — the instrument must be uncorrelated with the error term. This condition is not directly testable with a single instrument.
- Exclusion Restriction: affects only through — there is no direct effect. This restriction is a maintained assumption that must be argued substantively.
- (for LATE interpretation): The instrument affects treatment status in only one direction for all units — there are no "defiers." This assumption is required for the IV estimate to be interpretable as the average effect for compliers.
Two-Stage Least Squares (2SLS)
The estimation proceeds in two stages:
Stage 1: Regress the endogenous variable on the instrument(s) and controls:
Stage 2: Regress the outcome on the predicted values from Stage 1 and the controls:
The coefficient uses only the variation in that is driven by — purging the endogenous component.
IV Estimates as LATE
When treatment effects are heterogeneous, the IV estimator does not recover the for the entire population. Instead, it recovers the Local Average Treatment Effect (LATE) — the average effect for , i.e., units whose treatment status is affected by the instrument.
(Imbens & Angrist, 1994)This locality means the IV estimate may differ from the OLS estimate even if OLS were unbiased, because they estimate effects for different subpopulations. For settings where treatment effects are heterogeneous and researchers want to understand the full distribution of effects, methods like causal forests provide complementary tools.
Common Confusions
B. Identification
Formal Conditions
For the model (suppressing controls for clarity), the IV estimator is consistent if:
- (exogeneity / exclusion)
- (relevance)
Under these conditions:
The LATE Theorem
(Angrist et al., 1996)With a binary instrument and binary treatment , there are four types of units:
- Compliers: — take treatment when encouraged, do not when not
- Always-takers: — always take treatment
- Never-takers: — never take treatment
- Defiers: — do the opposite of encouragement
Under the monotonicity assumption (no defiers), the Wald estimator
identifies the LATE: the average treatment effect for compliers.
C. Visual Intuition
Picture a scatter plot of vs. where is endogenous. The OLS line through this cloud is biased because unobserved factors push both and in the same direction.
Now imagine the instrument as a lever that shifts left or right. Some units get a "push" (high ) and others do not (low ). The IV estimator looks at how much changes per unit of -induced change in . It ignores the endogenous part of entirely.
Think of it as plumbing. The ordinary relationship between and is contaminated — dirty water (endogeneity) flows in. The instrument provides a clean source of variation in . By isolating the variation in that comes from (the clean water), you can estimate the effect of on without contamination.
When confounding is zero, IV and OLS agree. As confounding increases, OLS drifts away from the truth while IV stays on target — but with wider confidence intervals. When instrument strength approaches zero, IV becomes erratic (the weak instrument problem).
Instrument Strength and Bias
IV recovers the true causal effect when the instrument is strong and the exclusion restriction holds. A weak instrument (low first-stage F) amplifies any small exclusion restriction violation into large bias.
Computed Results
- IV Estimate
- 2.00
- OLS Estimate (biased)
- 3.50
- IV Bias from Violation
- 0.000
Instrumental Variables (IV / 2SLS)
Explore how IV/2SLS corrects for endogeneity when an unobserved confounder U biases OLS. The DGP is D = 0.8·Z + 0.50·U + ν and Y = 2.0·D + 2·U + ε.
Regression Results
| Estimator | β̂ | Bias | 1st-stage F |
|---|---|---|---|
| OLS (biased) | 2.563 | +0.563 | — |
| IV / 2SLS | 2.180 | +0.180 | 137.7 |
| True β | 2.000 | — | — |
Number of observations to generate
The causal effect of D on Y
Coefficient of unobserved confounder U in the treatment equation D = π·Z + δ·U + v
First-stage coefficient of Z on D
Direct effect of Z on Y (should be 0 for valid IV)
IV corrects the bias. OLS is biased by +0.56 due to the unobserved confounder, while the IV estimate (2.18) is much closer to the true β = 2.0.
Why Instruments? Isolating Exogenous Variation
IV DGP: Y = 2.0·D + 2·U + ε, where D = 0.7·Z + 1.2·U + ν. Confounding strength = 0.6. Exclusion violation = 0.0.
Estimation Results
| Estimator | β̂ | SE | 95% CI | Bias |
|---|---|---|---|---|
| OLS | 2.866 | 0.056 | [2.76, 2.98] | +0.866 |
| IV / 2SLSclosest | 2.179 | 0.330 | [1.53, 2.83] | +0.179 |
| True β | 2.000 | — | — | — |
Number of observations
The true effect of D on Y
How strongly U affects both D and Y (0 = no endogeneity)
First-stage effect of Z on D (higher = stronger instrument)
Direct effect of Z on Y (should be 0 for valid IV)
Why the difference?
OLS is biased (+0.87) because D is endogenous: the confounder U pushes both D and Y in the same direction, inflating the estimated relationship. With confounding strength = 0.6, OLS attributes to D the effect that actually comes from U. IV isolates the exogenous variation in D driven by instrument Z (π̂ = 0.77, F = 97.4). The Wald ratio Cov(Z,Y)/Cov(Z,D) = 2.179 removes the confounding bias, yielding an estimate much closer to the truth.
D. Mathematical Derivation
Don't worry about the notation yet — here's what this means in words: 2SLS projects the endogenous regressors onto the instrument space, then runs OLS on the projected values. The formula is beta-hat-IV equals (X'Pz X) inverse times (X'Pz Y).
Let be the matrix of instruments (and exogenous regressors), be the matrix of all regressors (including endogenous ones), and the outcome vector.
Step 1: Project X onto instrument space.
Step 2: OLS of Y on projected X.
Consistency: Substitute :
Under and relevance ():
Variance (robust):
Important: Standard errors in Stage 2 must be computed using the original , not . Running two separate OLS regressions manually gives incorrect SEs. It is important to use a dedicated 2SLS command.
Bias of 2SLS in finite samples:
where is the first-stage F-statistic. In the just-identified case (one instrument), the relative bias (as a fraction of the OLS bias) is approximately . This follows because the concentration parameter , and the Nagar (1959) approximation gives relative bias of order . When is large, bias vanishes. When is small, 2SLS inherits much of the OLS bias. Note that with many instruments, the standard F-statistic can be misleadingly large; the effective F-statistic of Montiel Olea and Pflueger (2013) is the appropriate measure in that case.
LIML (Limited Information Maximum Likelihood): An alternative to 2SLS that is less biased with weak instruments. In just-identified models (one instrument per endogenous variable), LIML and 2SLS are numerically identical.
E. Implementation
library(fixest)
library(ivreg)
# 2SLS with fixest
iv_fit <- feols(gdp_pc ~ controls | 0 | institutions ~ settler_mortality,
data = df, vcov = "HC1")
summary(iv_fit)
# First-stage diagnostics
fitstat(iv_fit, type = "ivf") # First-stage F
fitstat(iv_fit, type = "sargan") # Over-ID test (if applicable)
# 2SLS with ivreg (more diagnostics)
iv_fit2 <- ivreg(gdp_pc ~ institutions + controls | settler_mortality + controls,
data = df)
summary(iv_fit2, diagnostics = TRUE)
# LIML (via the ivmodel package)
library(ivmodel)
iv_mod <- ivmodel(Y = df$gdp_pc, D = df$institutions, Z = df$settler_mortality,
X = as.matrix(df[, "controls", drop = FALSE]))
LIML(iv_mod)F. Diagnostics
First-Stage F-Statistic
A central diagnostic for IV. The rule of thumb from Staiger and Stock (1997) is .
More recent guidance from Lee et al. (2022) shows that the standard first-stage F must exceed 104.7 for the conventional -ratio critical value of 1.96 to control size at 5% in the just-identified case. Below that threshold, the authors provide an adjusted critical-value function ( procedure) that remains valid. In practice, F-statistics between 10 and 104.7 warrant the use of weak-instrument-robust inference methods such as the Anderson-Rubin test or the procedure.
Reduced Form
It is recommended to report the reduced-form regression: regress directly on (and controls). If the reduced form is insignificant, IV will be imprecise (even if the first stage is strong). The reduced form is the ITT analog in the IV framework.
Over-Identification Test (Sargan-Hansen J-Test)
When you have more instruments than endogenous variables, the J-test checks whether the instruments are consistent with each other. A rejection suggests at least one instrument violates the exclusion restriction. But beware: the test has low power when all instruments are invalid in the same direction.
Weak Instrument Robust Inference
When the first-stage F is low, use the Anderson-Rubin (AR) test, which is valid regardless of instrument strength. The AR confidence set inverts a test of the reduced-form null and has correct size even with arbitrarily weak instruments.
Hausman/Durbin-Wu-Hausman Test (OLS vs. IV)
Compare OLS and IV estimates formally. Under the null that OLS is consistent, OLS and IV should give similar estimates. If they differ significantly, OLS is likely biased.
Interpreting Results
- Sign and magnitude: IV estimates are often larger than OLS (in absolute value). Two common explanations: (1) measurement error in attenuates OLS estimates, and IV corrects this attenuation; (2) IV identifies the LATE for compliers, who may respond more strongly to treatment than the average person. These explanations are not mutually exclusive, and distinguishing between them requires additional analysis.
- Precision: IV estimates are typically much less precise than OLS. Wide confidence intervals are the norm, not the exception.
- Compliers: Think carefully about who the compliers are. In the Acemoglu et al. example, compliers are countries whose institutional quality was determined by settler mortality. In the Angrist and Krueger (1991) quarter-of-birth example, compliers are people who would have gotten more education if not for compulsory schooling laws.
G. What Can Go Wrong
| Problem | What It Does | How to Fix It |
|---|---|---|
| Weak instruments () | IV is biased toward OLS; confidence intervals have wrong coverage | Use Anderson-Rubin test; find a stronger instrument; use LIML |
| Exclusion restriction violated | IV is biased in an unknown direction; bias may be worse than OLS | Argue substantively; sensitivity analysis (Conley et al., 2012) |
| Forbidden regression | Running 2SLS manually with two OLS steps gives wrong SEs | Use dedicated IV commands |
| Many weak instruments | Bias toward OLS increases with number of instruments | Use LIML estimator or JIVE; reduce instrument count |
| Heterogeneous effects ignored | Interpreting LATE as ATE when they differ | Discuss complier characteristics; present OLS alongside |
| Instrument not excludable | Direct effect of Z on Y biases the IV estimate | Argue exclusion carefully; sensitivity analysis |
Weak Instrument Bias (F < 10)
Strong instrument: settler mortality has a first-stage F-statistic of 22.9
IV estimate of effect of institutions on log GDP: 0.94 (SE = 0.16). Bias relative to OLS is approximately 1/22.9 = 4%. The IV estimate is reliable.
Exclusion Restriction Violation
Rainfall affects civil conflict only through its effect on economic growth (the endogenous variable)
IV estimate of growth on conflict: -0.12 (SE = 0.04). If rainfall has no direct effect on conflict except through the economy, the exclusion restriction holds and the estimate is consistent.
The Forbidden Regression (Manual Two-Step OLS)
Use a dedicated 2SLS command that computes standard errors correctly
ivregress 2sls y (D = Z), vce(robust). SE = 0.16. Correct inference because the command uses the original D (not fitted D-hat) in the variance formula.
H. Practice
IV Validity: Instrumenting Military Service with Draft Lottery Numbers
Angrist (1990) estimates the effect of Vietnam-era military service on long-run earnings. The problem is that who serves is not random — men from disadvantaged backgrounds are more likely to enlist. The famous solution is to use draft lottery numbers (randomly assigned by birth date) as an instrument for military service.
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Read the analysis below carefully and identify the errors.
Select all errors you can find:
A researcher instruments 'years of schooling' with 'quarter of birth' to estimate the return to education. The first-stage F-statistic is 4.2. What is the main concern?
You estimate the effect of institutions on GDP using IV (settler mortality instrument) and OLS. The OLS estimate is 0.52 (SE = 0.06) and the IV estimate is 0.94 (SE = 0.16). Both are statistically significant. Why might the IV estimate be nearly twice as large?
Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.
Paper Summary
A study examines whether R&D spending affects firm revenue. The authors instrument R&D spending with 'industry-average R&D spending' (the average R&D of all OTHER firms in the same industry). Using a panel of 5,000 firms over 10 years, they find a first-stage F-statistic of 28 and estimate that a \$1M increase in R&D raises revenue by \$4.2M (p < 0.01). They include firm and year fixed effects.
Key Table
| Variable | Coefficient | Robust SE | p-value |
|---|---|---|---|
| R&D (instrumented) | 4.200 | 1.100 | 0.000 |
| Firm Size (log) | 0.015 | 0.006 | 0.012 |
| Firm FE | Yes | ||
| Year FE | Yes | ||
| First-stage F | 28 | ||
| N | 50,000 |
Authors' Identification Claim
Industry-average R&D (excluding the focal firm) is correlated with the focal firm's R&D through technology spillovers but is uncorrelated with firm-specific revenue shocks.
I. Swap-In: When to Use Something Else
- OLS with controls: When conditional exogeneity (selection on observables) is more credible than the exclusion restriction, and a rich set of covariates is available.
- Regression discontinuity: When treatment is assigned by a threshold on a running variable — RDD provides a more transparent and locally randomized design.
- Difference-in-differences: When a policy change provides before/after and treated/untreated variation without requiring an instrument.
- Matching: When selection into treatment is primarily on observables and the overlap condition is satisfied.
- Reduced form only: When the instrument is valid but weak (F < 10), reporting the reduced-form effect of the instrument on the outcome avoids the bias amplification of 2SLS.
J. Reviewer Checklist
Critical Reading Checklist
Paper Library
Foundational (9)
Angrist, J. D., & Krueger, A. B. (1991). Does Compulsory School Attendance Affect Schooling and Earnings?.
Angrist and Krueger used quarter of birth as an instrument for years of schooling, exploiting the fact that compulsory schooling laws interact with birth timing. This paper is one of the most-taught examples of instrumental variables in economics and also sparked important debates about weak instruments.
Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables.
This paper clarified what IV actually estimates: the Local Average Treatment Effect (LATE), which is the causal effect for 'compliers'—people whose treatment status is changed by the instrument. This reinterpretation fundamentally changed how researchers think about IV estimates and their external validity.
Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression.
Stock and Yogo developed critical values for testing whether instruments are 'weak'—that is, only weakly correlated with the endogenous variable. Their rule of thumb that the first-stage F-statistic should exceed 10 is probably the most widely used diagnostic in applied IV research.
Staiger, D., & Stock, J. H. (1997). Instrumental Variables Regression with Weak Instruments.
Staiger and Stock showed formally that when instruments are weak, 2SLS estimates are biased toward OLS and standard inference breaks down. This paper established the theoretical foundations for the weak instruments problem that Stock and Yogo (2005) later provided practical tests for.
Lee, D. S., McCrary, J., Moreira, M. J., & Porter, J. (2022). Valid t-Ratio Inference for IV.
Lee, McCrary, Moreira, and Porter showed that the conventional t-ratio in IV regression has correct size when the first-stage F-statistic exceeds 104.7, far above the traditional Stock-Yogo threshold of 10. This paper fundamentally raised the bar for what constitutes a sufficiently strong instrument and has prompted researchers to reconsider previously accepted IV results.
Imbens, G. W., & Angrist, J. D. (1994). Identification and Estimation of Local Average Treatment Effects.
The foundational paper on LATE. Showed that IV identifies the average causal effect for compliers -- the subpopulation whose treatment status is changed by the instrument -- under the monotonicity assumption. This reinterpretation fundamentally changed how researchers understand what IV estimates.
Montiel Olea, J. L., & Pflueger, C. (2013). A Robust Test for Weak Instruments.
Proposes an effective F-statistic for testing weak instruments that is robust to heteroscedasticity, serial correlation, and clustering — unlike the conventional first-stage F. The effective F is now the standard diagnostic for instrument strength in applied IV research.
Angrist, J. D. (1990). Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records.
A landmark application of instrumental variables using the Vietnam-era draft lottery as a natural experiment. Angrist showed that randomly assigned lottery numbers provide an instrument for military service, allowing causal estimation of the earnings effect of military service.
Manski, C. F. (1993). Identification of Endogenous Social Effects: The Reflection Problem.
Formalized the reflection problem: when individual outcomes depend on group averages, the group average is simultaneously determined by its members, making it impossible to distinguish true social (endogenous) effects from correlated effects without additional structure.
Application (7)
Acemoglu, D., Johnson, S., & Robinson, J. A. (2001). The Colonial Origins of Comparative Development: An Empirical Investigation.
This celebrated paper used historical settler mortality as an instrument for institutional quality to estimate the causal effect of institutions on economic development. It is one of the most influential IV applications in economics and demonstrates the creativity required to find a plausible instrument.
Levitt, S. D. (1997). Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime.
Levitt used the timing of mayoral and gubernatorial elections as an instrument for police hiring to estimate the causal effect of police on crime. The paper illustrates the IV approach in a policy-relevant setting where the key concern is reverse causality (more crime leads to more police).
Bloom, N., & Van Reenen, J. (2007). Measuring and Explaining Management Practices Across Firms and Countries.
Bloom and Van Reenen developed a survey-based measure of management practices and used IV strategies (including firm age and governance rules) to study the causal relationship between management quality and firm productivity. This paper is a prominent IV application in management and organizational economics.
Semadeni, M., Withers, M. C., & Certo, S. T. (2014). The Perils of Endogeneity and Instrumental Variables in Strategy Research: Understanding through Simulations.
This paper used Monte Carlo simulations to demonstrate the dangers of using weak or invalid instruments in strategy research. It provides practical guidance for management scholars on when and how to use IV, and when it may do more harm than good.
Albouy, D. Y. (2012). The Colonial Origins of Comparative Development: An Empirical Investigation: Comment.
Albouy critically re-examined the settler mortality instrument used in Acemoglu et al. (2001), showing that the original results are sensitive to data coding decisions and the sample of countries included. This comment is a cautionary tale about instrument validity and the fragility of influential IV estimates.
Miguel, E., Satyanath, S., & Sergenti, E. (2004). Economic Shocks and Civil Conflict: An Instrumental Variables Approach.
Instruments for economic growth using rainfall variation to estimate the causal effect of economic shocks on civil conflict in Sub-Saharan Africa. A clean and widely cited example of using weather as an instrumental variable, illustrating both the power and the exclusion restriction challenges of weather-based instruments.
Young, A. (2022). Consistency Without Inference: Instrumental Variables in Practical Application.
A provocative assessment showing that many published IV applications have first-stage F-statistics too weak for reliable inference when examined under modern standards. Highlights the gap between theoretical requirements for valid IV and actual practice in published research.
Survey (6)
Andrews, I., Stock, J. H., & Sun, L. (2019). Weak Instruments in Instrumental Variables Regression: Theory and Practice.
This survey provides an up-to-date review of the weak instruments problem, covering modern diagnostic tests, robust inference procedures, and practical recommendations. It is an excellent starting point for understanding the current best practices in IV estimation.
Stock, J. H., Wright, J. H., & Yogo, M. (2002). A Survey of Weak Instruments and Weak Identification in Generalized Method of Moments.
A comprehensive treatment of weak instruments and their consequences for inference in IV and GMM settings. Covers the theoretical foundations of the weak instrument problem and practical diagnostic tools.
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.
Chapter 4 provides an accessible yet rigorous treatment of instrumental variables, two-stage least squares, and the LATE framework. The go-to textbook reference for understanding IV estimation in the context of modern applied econometrics.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.
Chapter 5 offers a comprehensive graduate-level treatment of IV estimation, including GMM, tests for overidentification, and the relationship between IV and control function approaches. The standard graduate econometrics textbook reference for IV methods.
Angrist, J. D., & Krueger, A. B. (2001). Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments.
A historical survey tracing the evolution of IV from its origins in supply-and-demand estimation to modern natural experiments. Provides valuable context for understanding how IV methodology developed and why it became central to applied economics.
Murray, M. P. (2006). Avoiding Invalid Instruments and Coping with Weak Instruments.
Practical guidance on evaluating instrument validity and dealing with weak instruments in applied work. Written in an accessible style, it helps applied researchers think critically about their instrument choices and provides concrete strategies for addressing common IV pitfalls.