MethodAtlas
Method·intermediate·15 min read
Design-BasedEstablished

Instrumental Variables / 2SLS

Uses an external source of variation (instrument) that affects treatment but not the outcome directly.

When to UseWhen your key regressor is endogenous (correlated with the error term) and you have an instrument — a variable that affects the treatment but has no direct effect on the outcome.
AssumptionRelevance (instrument predicts the endogenous regressor, first-stage F > 10), exogeneity (instrument uncorrelated with the error), and the exclusion restriction (instrument affects the outcome only through the endogenous regressor). The exclusion restriction is untestable with a single instrument.
MistakeUsing a weak instrument (first-stage F < 10) without acknowledging the resulting bias toward OLS, or not reporting the first-stage F-statistic. Also, not recognizing that IV estimates LATE (the effect for compliers), not the ATE.
Reading Time~15 min read · 11 sections · 6 interactive exercises

One-Line Implementation

Rfeols(y ~ x1 | 0 | treatment ~ instrument, data = df, vcov = 'HC1')
Stataivregress 2sls y x1 (treatment = instrument), first vce(robust)
PythonIV2SLS(dependent=df['y'], exog=df[['const','x1']], endog=df['treatment'], instruments=df['instrument']).fit(cov_type='robust')

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Colonial Origins of Comparative Development

Why are some countries rich and others poor? Acemoglu et al. (2001) proposed that institutions — the rules governing economic activity — are the key driver. But institutions are endogenous: rich countries invest in better institutions, creating a classic chicken-and-egg problem.

Their solution was an instrument: settler mortality in the colonial era. The argument runs as follows:

  1. In places where European settlers faced high mortality (tropical diseases), colonizers set up extractive institutions designed to transfer wealth to the metropole.
  2. In places where settlers could survive (temperate climates), they created inclusive institutions with property rights and rule of law.
  3. These institutional differences persisted and shaped modern economic outcomes.
  4. For the design to be valid, one must assume that settler mortality from centuries ago affects current GDP only through its effect on institutions — i.e., it has no direct effect on economic performance today.

This assumption is the : the instrument (settler mortality) affects the outcome (GDP) only through the endogenous variable (institutions). If this restriction holds, 2SLS can recover the causal effect of institutions on development. A related strategy that builds on this IV logic is the shift-share (Bartik) instrument, which interacts local exposure shares with national-level shocks to generate cross-sectional variation.

Whether the exclusion restriction actually holds in this case has been debated for two decades (Albouy, 2012). That debate is itself a masterclass in IV methodology.

This strategy is the fundamental logic of instrumental variables: find an external source of variation that shifts the endogenous regressor without directly affecting the outcome, and use that variation to recover causal effects.


AOverview

The Endogeneity Problem

Consider the regression:

Yi=β0+β1Di+Xiγ+εiY_i = \beta_0 + \beta_1 D_i + X_i'\gamma + \varepsilon_i

If Cov(Di,εi)0\text{Cov}(D_i, \varepsilon_i) \neq 0 — the treatment or key regressor is correlated with the error — then OLS is biased and inconsistent. This endogeneity arises from , , or . Standard sensitivity analysis techniques can quantify how severe confounding must be to explain the estimated relationship, but when confounding is clearly present, OLS adjustment alone is insufficient.

The IV Solution

An instrumental variable ZiZ_i solves this endogeneity problem by isolating the part of DiD_i that is uncorrelated with εi\varepsilon_i. The instrument must satisfy three conditions for consistent estimation, plus a fourth for the LATE interpretation:

  1. Relevance: Cov(Zi,Di)0\text{Cov}(Z_i, D_i) \neq 0 — the instrument must actually affect the endogenous variable. This condition is testable.
  2. Independence (Exogeneity): Cov(Zi,εi)=0\text{Cov}(Z_i, \varepsilon_i) = 0 — the instrument must be uncorrelated with the error term. This condition is not directly testable with a single instrument.
  3. Exclusion Restriction: ZiZ_i affects YiY_i only through DiD_i — there is no direct effect. This restriction is a maintained assumption that must be argued substantively.
  4. (for LATE interpretation): The instrument affects treatment status in only one direction for all units — there are no "defiers." This assumption is required for the IV estimate to be interpretable as the average effect for compliers.

Two-Stage Least Squares (2SLS)

The estimation proceeds in two stages:

Stage 1: Regress the endogenous variable on the instrument(s) and controls:

Di=π0+π1Zi+Xiδ+viD_i = \pi_0 + \pi_1 Z_i + X_i'\delta + v_i

Stage 2: Regress the outcome on the predicted values D^i\hat{D}_i from Stage 1 and the controls:

Yi=β0+β1D^i+Xiγ+εiY_i = \beta_0 + \beta_1 \hat{D}_i + X_i'\gamma + \varepsilon_i

The coefficient β^1\hat{\beta}_1 uses only the variation in DiD_i that is driven by ZiZ_i — purging the endogenous component.

IV Estimates as LATE

When treatment effects are heterogeneous, the IV estimator does not recover the for the entire population. Instead, it recovers the Local Average Treatment Effect (LATE) — the average effect for , i.e., units whose treatment status is affected by the instrument (Imbens & Angrist, 1994).

This locality means the IV estimate may differ from the OLS estimate even if OLS were unbiased, because they estimate effects for different subpopulations. For settings where treatment effects are heterogeneous and researchers want to understand the full distribution of effects, methods like causal forests provide complementary tools.


Common Confusions


BIdentification

Formal Conditions

For the model Yi=βDi+εiY_i = \beta D_i + \varepsilon_i (suppressing controls for clarity), the IV estimator is consistent if:

  1. E[Ziεi]=0E[Z_i \varepsilon_i] = 0 (exogeneity / exclusion)
  2. E[ZiDi]0E[Z_i D_i] \neq 0 (relevance)

Under these conditions:

β^IV=Cov^(Z,Y)Cov^(Z,D)pCov(Z,Y)Cov(Z,D)=βCov(Z,D)Cov(Z,D)=β\hat{\beta}_{IV} = \frac{\widehat{\text{Cov}}(Z, Y)}{\widehat{\text{Cov}}(Z, D)} \xrightarrow{p} \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, D)} = \frac{\beta \cdot \text{Cov}(Z, D)}{\text{Cov}(Z, D)} = \beta

The LATE Theorem

With a binary instrument Z{0,1}Z \in \{0,1\} and binary treatment D{0,1}D \in \{0,1\}, there are four types of units:

  • Compliers: D(1)=1,D(0)=0D(1) = 1, D(0) = 0 — take treatment when encouraged, do not when not
  • : D(1)=D(0)=1D(1) = D(0) = 1 — always take treatment
  • : D(1)=D(0)=0D(1) = D(0) = 0 — never take treatment
  • : D(1)=0,D(0)=1D(1) = 0, D(0) = 1 — do the opposite of encouragement

Under the monotonicity assumption (no defiers), the

β^IV=E[YZ=1]E[YZ=0]E[DZ=1]E[DZ=0]\hat{\beta}_{IV} = \frac{E[Y \mid Z=1] - E[Y \mid Z=0]}{E[D \mid Z=1] - E[D \mid Z=0]}

identifies the LATE: the average treatment effect for compliers (Imbens & Angrist, 1994).


CVisual Intuition

Picture a scatter plot of YY vs. DD where DD is endogenous. The OLS line through this cloud is biased because unobserved factors push both DD and YY in the same direction.

Now imagine the instrument ZZ as a lever that shifts DD left or right. Some units get a "push" (high ZZ) and others do not (low ZZ). The IV estimator looks at how much YY changes per unit of ZZ-induced change in DD. It ignores the endogenous part of DD entirely.

Think of it as plumbing. The ordinary relationship between DD and YY is contaminated — dirty water (endogeneity) flows in. The instrument ZZ provides a clean source of variation in DD. By isolating the variation in DD that comes from ZZ (the clean water), you can estimate the effect of DD on YY without contamination.

When confounding is zero, IV and OLS agree. As confounding increases, OLS drifts away from the truth while IV stays on target — but with wider confidence intervals. When instrument strength approaches zero, IV becomes erratic (the weak instrument problem).

Explore how treatment effect heterogeneity across complier subpopulations affects the LATE interpretation:

See how the relative bias of IV estimation depends on the first-stage F-statistic. The danger zone (F < 10) produces severe bias:


DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: 2SLS projects the endogenous regressors onto the instrument space, then runs OLS on the projected values. The formula is beta-hat-IV equals (X'Pz X) inverse times (X'Pz Y).

Let Z\mathbf{Z} be the matrix of instruments (and exogenous regressors), X\mathbf{X} be the matrix of all regressors (including endogenous ones), and Y\mathbf{Y} the outcome vector.

Step 1: Project X onto instrument space.

PZ=Z(ZZ)1ZP_Z = Z(Z'Z)^{-1}Z'X^=PZX\hat{X} = P_Z X

Step 2: OLS of Y on projected X.

β^2SLS=(X^X^)1X^Y=(XPZX)1XPZY\hat{\beta}_{2SLS} = (\hat{X}'\hat{X})^{-1}\hat{X}'Y = (X'P_Z X)^{-1} X'P_Z Y

Consistency: Substitute Y=Xβ+uY = X\beta + u:

β^2SLS=β+(XPZX)1XPZu\hat{\beta}_{2SLS} = \beta + (X'P_Z X)^{-1} X'P_Z u

Under E[Zu]=0E[Z'u] = 0 and relevance (rank(E[ZX])=k\text{rank}(E[Z'X]) = k):

β^2SLSpβ\hat{\beta}_{2SLS} \xrightarrow{p} \beta

Variance (robust):

V^2SLS=(XPZX)1(iε^i2X^iX^i)(XPZX)1\hat{V}_{2SLS} = (X'P_Z X)^{-1} \left(\sum_i \hat{\varepsilon}_i^2 \hat{X}_i \hat{X}_i'\right) (X'P_Z X)^{-1}

Important: Standard errors in Stage 2 must be computed using the original XX, not X^\hat{X}. Running two separate OLS regressions manually gives incorrect SEs. It is important to use a dedicated 2SLS command.

Bias of 2SLS in finite samples:

E[β^2SLSβ]1F+1E[β^OLSβ]E[\hat{\beta}_{2SLS} - \beta] \approx \frac{1}{F+1} \cdot E[\hat{\beta}_{OLS} - \beta]

where FF is the first-stage F-statistic. In the just-identified case (one instrument), the relative bias (as a fraction of the OLS bias) is approximately 1/(F+1)1/(F+1) (Angrist & Pischke, 2009). For large FF, this is well approximated by 1/F1/F. When FF is large, bias vanishes. When FF is small, 2SLS inherits much of the OLS bias. Note that with many instruments, the standard F-statistic can be misleadingly large; the effective F-statistic of Montiel Olea and Pflueger (2013) is the appropriate measure in that case.

LIML (Limited Information Maximum Likelihood): An alternative to 2SLS that is less biased with weak instruments. In just-identified models (one instrument per endogenous variable), LIML and 2SLS are numerically identical.


EImplementation

# Requires: fixest, ivreg, ivmodel
library(fixest)   # fixest: fast FE estimation with built-in IV support
library(ivreg)    # ivreg: classic 2SLS with comprehensive diagnostics

# --- Step 1: 2SLS estimation with fixest ---
# feols IV syntax: outcome ~ exogenous | FE | endogenous ~ instrument
# institutions is endogenous; settler_mortality is the instrument
# vcov = "HC1": heteroskedasticity-robust standard errors
iv_fit <- feols(gdp_pc ~ controls | 0 | institutions ~ settler_mortality,
              data = df, vcov = "HC1")
summary(iv_fit)
# Coefficient on institutions: causal effect (LATE for compliers)

# --- Step 2: First-stage diagnostics ---
# First-stage F: instrument strength (rule of thumb: F > 10, ideally > 104.7)
fitstat(iv_fit, type = "ivf")
# Sargan over-ID test: valid only with multiple instruments (H0: all instruments valid)
fitstat(iv_fit, type = "sargan")

# --- Step 3: 2SLS with ivreg (richer diagnostics) ---
# ivreg formula: outcome ~ endogenous + exogenous | instruments + exogenous
iv_fit2 <- ivreg(gdp_pc ~ institutions + controls | settler_mortality + controls,
               data = df)
# diagnostics = TRUE: reports weak instrument test, Wu-Hausman, and Sargan tests
summary(iv_fit2, diagnostics = TRUE)

# --- Step 4: LIML (less biased with weak instruments) ---
# LIML is approximately median-unbiased even when instruments are weak
# In the just-identified case (1 instrument, 1 endogenous), LIML = 2SLS
library(ivmodel)   # ivmodel: weak-instrument-robust IV inference
iv_mod <- ivmodel(Y = df$gdp_pc, D = df$institutions, Z = df$settler_mortality,
                X = as.matrix(df[, "controls", drop = FALSE]))
LIML(iv_mod)
# If LIML and 2SLS differ substantially, weak instruments are a concern

FDiagnostics

First-Stage F-Statistic

A central diagnostic for IV. The rule of thumb F>10F > 10 was proposed by Staiger and Stock (1997); Stock and Yogo (2005) formalized it with critical values for size distortion.

More recent guidance from Lee et al. (2022) shows that the standard first-stage F must exceed 104.7 for the conventional tt-ratio critical value of 1.96 to control size at 5% in the just-identified case. Below that threshold, the authors provide an adjusted critical-value function (tFtF procedure) that remains valid. In practice, F-statistics between 10 and 104.7 warrant the use of weak-instrument-robust inference methods such as the Anderson-Rubin test or the tFtF procedure.

Reduced Form

It is recommended to report the regression: regress YY directly on ZZ (and controls). If the reduced form is insignificant, IV will be imprecise (even if the first stage is strong). The reduced form is the intent-to-treat (ITT) analog in the IV framework.

Over-Identification Test (Sargan-Hansen J-Test)

When you have more instruments than endogenous variables, the J-test checks whether the instruments are consistent with each other. A rejection suggests at least one instrument violates the exclusion restriction. But beware: the test has low power when all instruments are invalid in the same direction. Rejection may also indicate treatment effect heterogeneity rather than instrument invalidity (Angrist & Pischke, 2009).

Weak Instrument Robust Inference

When the first-stage F is low, use the Anderson-Rubin (AR) test, which is valid regardless of instrument strength. The AR confidence set inverts a test of the reduced-form null and has correct size even with arbitrarily weak instruments.

Hausman/Durbin-Wu-Hausman Test (OLS vs. IV)

Compare OLS and IV estimates formally. Under the null that OLS is consistent, OLS and IV should give similar estimates. If they differ significantly, OLS is likely biased. The regression-based Durbin-Wu-Hausman test is the practical implementation: estimate the first stage, obtain residuals, include them in the structural equation, and test their significance. Under the null of exogeneity, the coefficient on the residuals is zero (Wooldridge, 2010).


Interpreting Your Results

  • Sign and magnitude: IV estimates are often larger than OLS (in absolute value). Two common explanations: (1) measurement error in DD attenuates OLS estimates (), and IV corrects this attenuation; (2) IV identifies the LATE for compliers, who may respond more strongly to treatment than the average person. These explanations are not mutually exclusive, and distinguishing between them requires additional analysis.
  • Precision: IV estimates are typically much less precise than OLS. Wide confidence intervals are the norm, not the exception.
  • Compliers: Think carefully about who the compliers are. In the Acemoglu et al. example, compliers are countries whose institutional quality was determined by settler mortality. In the Angrist and Krueger (1991) quarter-of-birth example, compliers are people who got more education because of compulsory schooling laws — they would have dropped out sooner without the laws.

GWhat Can Go Wrong

ProblemWhat It DoesHow to Fix It
Weak instruments (F<10F < 10)IV is biased toward OLS; confidence intervals have wrong coverageUse Anderson-Rubin test; find a stronger instrument; use LIML
Exclusion restriction violatedIV is biased in an unknown direction; bias may be worse than OLSArgue substantively; sensitivity analysis (Conley et al., 2012)
Manual two-step OLSRunning 2SLS manually with two separate OLS regressions gives wrong SEsUse dedicated IV commands
Forbidden regressionUsing nonlinear first-stage predictions (e.g., from probit) in a linear second stage — identification comes from functional form, not the exclusion restrictionUse linear first stage, or the control function approach
Many weak instrumentsBias toward OLS increases with number of instrumentsUse LIML estimator or JIVE; reduce instrument count
Heterogeneous effects ignoredInterpreting LATE as ATE when they differDiscuss complier characteristics; present OLS alongside
Instrument not excludableDirect effect of Z on Y biases the IV estimateArgue exclusion carefully; sensitivity analysis
What Can Go Wrong

Weak Instrument Bias (F < 10)

Strong instrument: settler mortality has a first-stage F-statistic of 22.9

IV estimate of effect of institutions on log GDP: 0.94 (SE = 0.16). Bias relative to OLS is approximately 1/(22.9+1) ≈ 4%. The IV estimate is reliable.

What Can Go Wrong

Exclusion Restriction Violation

Rainfall affects civil conflict only through its effect on economic growth (the endogenous variable)

IV estimate of growth on conflict: -0.12 (SE = 0.04). If rainfall has no direct effect on conflict except through the economy, the exclusion restriction holds and the estimate is consistent.

What Can Go Wrong

Manual Two-Step OLS (Wrong Standard Errors)

Use a dedicated 2SLS command that computes standard errors correctly

ivregress 2sls y (D = Z), vce(robust). SE = 0.16. Correct inference because the command uses the original D (not fitted D-hat) in the variance formula.

For relaxing the exclusion restriction, see the plausibly exogenous instruments framework (Conley et al., 2012).


HPractice

Guided Exercise

IV Validity: Instrumenting Military Service with Draft Lottery Numbers

Angrist (1990) estimates the effect of Vietnam-era military service on long-run earnings. The problem is that who serves is not random — men from disadvantaged backgrounds are more likely to enlist. The famous solution is to use draft lottery numbers (randomly assigned by birth date) as an instrument for military service.

What is the relevance condition for this instrument?

What is the exclusion restriction for this instrument?

If men with low lottery numbers were more likely to drop out of college (to avoid the draft), would this violate the exclusion restriction?

The 2SLS estimate represents the effect of military service for which group of men?

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the effect of immigration on native wages. They instrument local immigrant share with historical immigrant settlement patterns (a shift-share instrument). Using data from 200 metropolitan areas, they report:

ivregress 2sls native_wage controls (immigrant_share = historical_share), vce(robust)

Coefficient on immigrant_share: -0.35 (SE = 0.12, p = 0.004). First-stage F = 45. "We find that a 1 percentage point increase in immigrant share causes a 0.35% decrease in native wages. The strong first stage (F = 45) confirms instrument validity."

Select all errors you can find:

Error Detective

Read the analysis below carefully and identify the errors.

A finance researcher instruments CEO overconfidence (measured by option exercise behavior) with the CEO's birth order (first-born vs. later-born), citing psychology literature that first-borns are more confident. Using a cross-section of 800 firms:

First-stage F-statistic: 6.8 IV estimate of overconfidence on firm investment: 0.42 (SE = 0.25, p = 0.09)

"Although the first-stage F is below 10, the IV estimate is marginally significant, suggesting overconfident CEOs invest more aggressively. Birth order is clearly exogenous because it is determined at birth."

Select all errors you can find:

Concept Check

A researcher instruments 'years of schooling' with 'quarter of birth' to estimate the return to education. The first-stage F-statistic is 4.2. What is the main concern?

Concept Check

You estimate the effect of institutions on GDP using IV (settler mortality instrument) and OLS. The OLS estimate is 0.52 (SE = 0.06) and the IV estimate is 0.94 (SE = 0.16). Both are statistically significant. Why might the IV estimate be nearly twice as large?

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

A study examines whether R&D spending affects firm revenue. The authors instrument R&D spending with 'industry-average R&D spending' (the average R&D of all OTHER firms in the same industry). Using a panel of 5,000 firms over 10 years, they find a first-stage F-statistic of 28 and estimate that a $1M increase in R&D raises revenue by $4.2M (p < 0.01). They include firm and year fixed effects.

Key Table

VariableCoefficientRobust SEp-value
R&D (instrumented)4.2001.1000.000
Firm Size (log)0.0150.0060.012
Firm FEYes
Year FEYes
First-stage F28
N50,000

Authors' Identification Claim

Industry-average R&D (excluding the focal firm) is correlated with the focal firm's R&D through technology spillovers but is uncorrelated with firm-specific revenue shocks.


ISwap-In: When to Use Something Else

  • OLS with controls: When conditional exogeneity (selection on observables) is more credible than the exclusion restriction, and a rich set of covariates is available.
  • Regression discontinuity: When treatment is assigned by a threshold on a running variable — RDD provides a more transparent and locally randomized design.
  • Difference-in-differences: When a policy change provides before/after and treated/untreated variation without requiring an instrument.
  • Matching: When selection into treatment is primarily on observables and the overlap condition is satisfied.
  • Reduced form only: When the instrument is valid but weak (F < 10), reporting the reduced-form effect of the instrument on the outcome avoids the bias amplification of 2SLS.

JReviewer Checklist

Critical Reading Checklist

0 of 10 items checked0%

Paper Library

Foundational (12)

Angrist, J. D. (1990). Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records.

American Economic Review

Angrist uses the Vietnam-era draft lottery as a natural experiment in this landmark application of instrumental variables. He shows that randomly assigned lottery numbers provide an instrument for military service, allowing causal estimation of the earnings effect of military service.

Angrist, J. D., & Krueger, A. B. (1991). Does Compulsory School Attendance Affect Schooling and Earnings?.

Quarterly Journal of EconomicsDOI: 10.2307/2937954

Angrist and Krueger use quarter of birth as an instrument for years of schooling, exploiting the fact that compulsory schooling laws interact with birth timing. This paper is one of the most-taught examples of instrumental variables in economics and also sparks important debates about weak instruments.

Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1996.10476902

Angrist, Imbens, and Rubin formalize the LATE framework — originally introduced in Imbens and Angrist (1994) — within the Rubin Causal Model, providing a detailed treatment of the assumptions required for causal interpretation of IV estimates. This paper introduces the complier taxonomy (always-takers, never-takers, compliers, defiers) that is now standard in the IV literature. The practical implication is that IV estimates should be interpreted as local to the complier subpopulation, not as average effects for the entire population.

Bound, J., Jaeger, D. A., & Baker, R. M. (1995). Problems with Instrumental Variables Estimation When the Correlation Between the Instruments and the Endogenous Explanatory Variable Is Weak.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1995.10476536

Bound, Jaeger, and Baker demonstrate that instrumental variables estimates can be severely biased when instruments are weakly correlated with the endogenous regressor. They show that with weak instruments, the finite-sample bias of IV approaches that of OLS, and that the standard IV confidence intervals can have coverage far below their nominal levels. The paper motivates the widespread practice of reporting first-stage F-statistics as a diagnostic for instrument strength.

Conley, T. G., Hansen, C. B., & Rossi, P. E. (2012). Plausibly Exogenous.

Review of Economics and StatisticsDOI: 10.1162/REST_a_00139

Conley, Hansen, and Rossi develop methods for inference when the exclusion restriction is 'plausibly' rather than exactly satisfied, parameterizing the degree of violation and constructing valid confidence intervals. This approach provides a formal sensitivity analysis for IV estimates, answering the question: how large would the violation of the exclusion restriction need to be to overturn the result? Applied researchers can use these methods to transparently assess the robustness of IV findings to a common critique.

Frake, J., Gibbs, A., Goldfarb, B., Hiraiwa, T., Starr, E., & Yamaguchi, S. (2025). From Perfect to Practical: Partial Identification Methods for Causal Inference in Strategic Management Research.

Strategic Management JournalDOI: 10.1002/smj.3714

Frake and colleagues introduce partial identification methods to strategic management, providing a practical framework for assessing the sensitivity of difference-in-differences and instrumental variables estimates to violations of identifying assumptions. The paper demonstrates how researchers can construct informative bounds on treatment effects when parallel trends or exclusion restriction assumptions are relaxed. It bridges the gap between the theoretical ideal of point identification and the practical reality that identifying assumptions are rarely perfectly satisfied.

Imbens, G. W., & Angrist, J. D. (1994). Identification and Estimation of Local Average Treatment Effects.

EconometricaDOI: 10.2307/2951620

Imbens and Angrist show that IV identifies the average causal effect for compliers -- the subpopulation whose treatment status is changed by the instrument -- under the monotonicity assumption, in this foundational paper on LATE. This reinterpretation fundamentally changes how researchers understand what IV estimates.

Lee, D. S., McCrary, J., Moreira, M. J., & Porter, J. (2022). Valid t-Ratio Inference for IV.

American Economic ReviewDOI: 10.1257/aer.20211063

Lee, McCrary, Moreira, and Porter address the potentially severe large-sample distortions of t-ratio-based inference in the single-IV model. They introduce the tF critical value function, a standard error adjustment that is a smooth function of the first-stage F-statistic, which corrects for weak instrument bias. They find that for one-quarter of specifications in 61 AER papers, corrected standard errors are at least 49% larger than conventional 2SLS standard errors at the 5% significance level. The practical implication is that researchers using IV should apply their tF correction rather than relying on conventional standard errors.

Manski, C. F. (1993). Identification of Endogenous Social Effects: The Reflection Problem.

Review of Economic StudiesDOI: 10.2307/2298123

Manski formalizes the reflection problem in the analysis of social interactions: when individual outcomes depend on group averages, the group average is simultaneously determined by its members. This simultaneity makes it impossible to distinguish true social (endogenous) effects from correlated effects without additional structure or exclusion restrictions. The paper is essential reading for any researcher attempting to estimate peer effects or social spillovers.

Montiel Olea, J. L., & Pflueger, C. (2013). A Robust Test for Weak Instruments.

Journal of Business & Economic StatisticsDOI: 10.1080/00401706.2013.806694

Montiel Olea and Pflueger propose an effective F-statistic for testing weak instruments that is robust to heteroscedasticity, serial correlation, and clustering — unlike the conventional first-stage F. The effective F is now the standard diagnostic for instrument strength in applied IV research.

Staiger, D., & Stock, J. H. (1997). Instrumental Variables Regression with Weak Instruments.

EconometricaDOI: 10.2307/2171753

Staiger and Stock show formally that when instruments are weak, 2SLS estimates are biased toward OLS and standard inference breaks down. This paper establishes the theoretical foundations for the weak instruments problem that Stock and Yogo (2005) later provided practical tests for.

Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression.

Identification and Inference for Econometric Models: Essays in Honor of Thomas RothenbergDOI: 10.1017/CBO9780511614491.006

Stock and Yogo develop formal critical value tables for testing whether instruments are 'weak'—that is, only weakly correlated with the endogenous variable. Their tables formalize the Staiger and Stock (1997) rule of thumb that the first-stage F-statistic should exceed 10, and are probably the most widely used diagnostic in applied IV research.

Application (7)

Acemoglu, D., Johnson, S., & Robinson, J. A. (2001). The Colonial Origins of Comparative Development: An Empirical Investigation.

American Economic ReviewDOI: 10.1257/aer.91.5.1369

Acemoglu, Johnson, and Robinson use historical settler mortality as an instrument for institutional quality to estimate the causal effect of institutions on economic development in this celebrated paper. It is one of the most influential IV applications in economics and demonstrates the creativity required to find a plausible instrument.

Albouy, D. Y. (2012). The Colonial Origins of Comparative Development: An Empirical Investigation: Comment.

American Economic ReviewDOI: 10.1257/aer.102.6.3059

Albouy critically re-examines the settler mortality instrument used in Acemoglu et al. (2001), showing that the original results are sensitive to data coding decisions and the sample of countries included. This comment is a cautionary tale about instrument validity and the fragility of influential IV estimates.

Bennedsen, M., Nielsen, K. M., Pérez-González, F., & Wolfenzon, D. (2007). Inside the Family Firm: The Role of Families in Succession Decisions and Performance.

Quarterly Journal of EconomicsDOI: 10.1162/qjec.122.2.647

Bennedsen et al. use the gender of the controlling family's firstborn child as an instrument for whether the successor CEO is a family member or a professional outsider. They find that family successions cause a large negative impact on firm performance, with operating profitability falling by at least four percentage points. The paper demonstrates how a creative natural experiment can address endogeneity in corporate governance research.

Bloom, N., & Van Reenen, J. (2007). Measuring and Explaining Management Practices Across Firms and Countries.

Quarterly Journal of EconomicsDOI: 10.1162/qjec.2007.122.4.1351

Bloom and Van Reenen develop a survey-based measure of management practices and document that better management is strongly associated with higher productivity, profitability, and growth. They use IV strategies (including product market competition and primogeniture rules for family management succession) to investigate why management quality varies, finding that poor management is more prevalent when competition is weak and when family firms follow primogeniture. The paper is foundational for the measurement of management practices; the IV analysis is one component of a broader measurement and descriptive study.

Levitt, S. D. (1997). Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime.

American Economic Review

Levitt uses the timing of mayoral and gubernatorial elections as an instrument for police hiring to estimate the causal effect of police on crime. The paper illustrates the IV approach in a policy-relevant setting where the key concern is reverse causality (more crime leads to more police).

Miguel, E., Satyanath, S., & Sergenti, E. (2004). Economic Shocks and Civil Conflict: An Instrumental Variables Approach.

Journal of Political EconomyDOI: 10.1086/421174

Miguel, Satyanath, and Sergenti instrument for economic growth using rainfall variation to estimate the causal effect of economic shocks on civil conflict in Sub-Saharan Africa. Their paper is a clean and widely cited example of using weather as an instrumental variable, illustrating both the power and the exclusion restriction challenges of weather-based instruments.

Young, A. (2022). Consistency Without Inference: Instrumental Variables in Practical Application.

European Economic ReviewDOI: 10.1016/j.euroecorev.2022.104112

Young reexamines published IV applications and argues that standard first-stage F-statistic diagnostics are largely uninformative of both size and bias under non-iid errors and high leverage. The paper finds that IV estimates in practice rarely demonstrate that OLS is biased, raising broader questions about the reliability of IV as commonly implemented.

Survey (8)

Andrews, I., Stock, J. H., & Sun, L. (2019). Weak Instruments in Instrumental Variables Regression: Theory and Practice.

Annual Review of EconomicsDOI: 10.1146/annurev-economics-080218-025643

Andrews, Stock, and Sun provide an up-to-date review of the weak instruments problem, covering modern diagnostic tests, robust inference procedures, and practical recommendations. It is an excellent starting point for understanding the current best practices in IV estimation.

Angrist, J. D., & Krueger, A. B. (2001). Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments.

Journal of Economic PerspectivesDOI: 10.1257/jep.15.4.69

Angrist and Krueger trace the evolution of IV from its origins in supply-and-demand estimation to modern natural experiments in this historical survey. They provide valuable context for understanding how IV methodology developed and why it becomes central to applied economics.

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.

Cunningham, S. (2021). Causal Inference: The Mixtape.

Yale University PressDOI: 10.12987/9780300255881Replication

Cunningham provides an accessible textbook with an excellent DiD chapter that walks through the intuition, the math, and the code (in Stata and R). Freely available online at mixtape.scunning.com, it is a valuable companion for students who want worked examples alongside formal treatment.

Murray, M. P. (2006). Avoiding Invalid Instruments and Coping with Weak Instruments.

Journal of Economic PerspectivesDOI: 10.1257/jep.20.4.111

Murray provides practical guidance on evaluating instrument validity and dealing with weak instruments in applied work. Written in an accessible style, it helps applied researchers think critically about their instrument choices and provides concrete strategies for addressing common IV pitfalls.

Semadeni, M., Withers, M. C., & Certo, S. T. (2014). The Perils of Endogeneity and Instrumental Variables in Strategy Research: Understanding through Simulations.

Strategic Management JournalDOI: 10.1002/smj.2136

Semadeni, Withers, and Certo use Monte Carlo simulations to demonstrate the dangers of using weak or invalid instruments in strategy research. They provide practical guidance for management scholars on when and how to use IV, and when it may do more harm than good.

Stock, J. H., Wright, J. H., & Yogo, M. (2002). A Survey of Weak Instruments and Weak Identification in Generalized Method of Moments.

Journal of Business & Economic StatisticsDOI: 10.1198/073500102288618658

Stock, Wright, and Yogo survey the weak instruments and weak identification literature in IV and GMM settings, covering finite-sample bias toward OLS, size distortions in Wald tests, and practical diagnostic tools. The paper provides a comprehensive review of the theoretical landscape; the formal critical value tables now standard in applied work appear in the separate Stock and Yogo (2005) chapter.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.

MIT Press

Wooldridge's graduate textbook is the standard reference for cross-section and panel data econometrics. Chapters 10-11 provide a thorough treatment of fixed effects, random effects, and related panel data methods, while later chapters cover general estimation methodology (MLE, GMM, M-estimation) with panel data applications throughout. The book covers both linear and nonlinear models with careful attention to assumptions.

Tags

design-basedendogeneityLATE