When should I use Instrumental Variables / 2SLS?

When your key regressor is endogenous (correlated with the error term) and you have an instrument — a variable that affects the treatment but has no direct effect on the outcome.

What is the key assumption of Instrumental Variables / 2SLS?

Relevance (instrument predicts the endogenous regressor; Staiger-Stock 1997 screening rule F > 10, with LMMP 2022 requiring F > 104.7 for valid 5% t-test inference in the just-identified case), exogeneity (instrument uncorrelated with the error), and the exclusion restriction (instrument affects the outcome only through the endogenous regressor). The exclusion restriction is untestable with a single instrument.

What is the most common mistake with Instrumental Variables / 2SLS?

Using a weak instrument (first-stage F below the Staiger-Stock 1997 screening threshold of 10, or below the LMMP 2022 F > 104.7 just-identified threshold) without acknowledging the resulting bias toward OLS, or not reporting the first-stage F-statistic. Also, not recognizing that IV estimates LATE (the effect for compliers), not the ATE.

Method·intermediate·13 min read

Design-BasedEstablished

Instrumental Variables / 2SLS

Uses an external source of variation (instrument) that affects treatment but not the outcome directly.

When to Use: When your key regressor is endogenous (correlated with the error term) and you have an instrument — a variable that affects the treatment but has no direct effect on the outcome.
Assumption: Relevance (instrument predicts the endogenous regressor; Staiger-Stock 1997 screening rule F > 10, with LMMP 2022 requiring F > 104.7 for valid 5% t-test inference in the just-identified case), exogeneity (instrument uncorrelated with the error), and the exclusion restriction (instrument affects the outcome only through the endogenous regressor). The exclusion restriction is untestable with a single instrument.
Mistake: Using a weak instrument (first-stage F below the Staiger-Stock 1997 screening threshold of 10, or below the LMMP 2022 F > 104.7 just-identified threshold) without acknowledging the resulting bias toward OLS, or not reporting the first-stage F-statistic. Also, not recognizing that IV estimates LATE (the effect for compliers), not the ATE.
Reading Time: ~13 min read · 11 sections · 6 interactive exercises

One-Line Implementation

Rfeols(y ~ x1 | 0 | treatment ~ instrument, data = df, vcov = 'HC1')

Stataivregress 2sls y x1 (treatment = instrument), first vce(robust)

Python

IV2SLS(dependent=df['y'], exog=df[['const','x1']], endog=df['treatment'], instruments=df['instrument']).fit(cov_type='robust')

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Colonial Origins of Comparative Development

Why are some countries rich and others poor? Acemoglu et al. (2001) proposed that institutions — the rules governing economic activity — are the key driver. But institutions are endogenous: rich countries invest in better institutions, creating a classic chicken-and-egg problem.

Their solution was an instrument: settler mortality in the colonial era. The argument runs as follows:

In places where European settlers faced high mortality (tropical diseases), colonizers set up extractive institutions designed to transfer wealth to the metropole.
In places where settlers could survive (temperate climates), they created inclusive institutions with property rights and rule of law.
These institutional differences persisted and shaped modern economic outcomes.
For the design to be valid, one must assume that settler mortality from centuries ago affects current GDP only through its effect on institutions — i.e., it has no direct effect on economic performance today.

This assumption is the : the instrument (settler mortality) affects the outcome (GDP) only through the endogenous variable (institutions). If this restriction holds, 2SLS can recover the causal effect of institutions on development. A related strategy that builds on this IV logic is the shift-share (Bartik) instrument, which interacts local exposure shares with national-level shocks to generate cross-sectional variation.

Whether the exclusion restriction actually holds in this case has been debated for two decades (Albouy, 2012). That debate is itself a masterclass in IV methodology.

This strategy is the fundamental logic of instrumental variables: find an external source of variation that shifts the endogenous regressor without directly affecting the outcome, and use that variation to recover causal effects.

AOverview

The Endogeneity Problem

Consider the regression:

Y_i = \beta_0 + \beta_1 D_i + X_i'\gamma + \varepsilon_i

If $\text{Cov}(D_i, \varepsilon_i) \neq 0$ — the treatment or key regressor is correlated with the error — then OLS is biased and inconsistent. This endogeneity arises from , , or . Standard sensitivity analysis techniques can quantify how severe confounding must be to explain the estimated relationship, but when confounding is clearly present, OLS adjustment alone is insufficient.

The IV Solution

An instrumental variable $Z_i$ solves this endogeneity problem by isolating the part of $D_i$ that is uncorrelated with $\varepsilon_i$ . The instrument must satisfy three conditions for consistent estimation, plus a fourth for the LATE interpretation:

Relevance: $\text{Cov}(Z_i, D_i) \neq 0$ — the instrument must actually affect the endogenous variable. This condition is testable.
Independence (Exogeneity): $\text{Cov}(Z_i, \varepsilon_i) = 0$ — the instrument must be uncorrelated with the error term. This condition is not directly testable with a single instrument.
Exclusion Restriction: $Z_i$ affects $Y_i$ only through $D_i$ — there is no direct effect. This restriction is a maintained assumption that must be argued substantively.
(for LATE interpretation): The instrument affects treatment status in only one direction for all units — there are no "defiers." This assumption is required for the IV estimate to be interpretable as the average effect for compliers.

Two-Stage Least Squares (2SLS)

The estimation proceeds in two stages:

Stage 1: Regress the endogenous variable on the instrument(s) and controls:

D_i = \pi_0 + \pi_1 Z_i + X_i'\delta + v_i

Stage 2: Regress the outcome on the predicted values $\hat{D}_i$ from Stage 1 and the controls:

Y_i = \beta_0 + \beta_1 \hat{D}_i + X_i'\gamma + \varepsilon_i

The coefficient $\hat{\beta}_1$ uses only the variation in $D_i$ that is driven by $Z_i$ — purging the endogenous component.

IV Estimates as LATE

When treatment effects are heterogeneous, the IV estimator does not recover the for the entire population. Instead, it recovers the Local Average Treatment Effect (LATE) — the average effect for , i.e., units whose treatment status is affected by the instrument (Imbens & Angrist, 1994).

This locality means the IV estimate may differ from the OLS estimate even if OLS were unbiased, because they estimate effects for different subpopulations. For settings where treatment effects are heterogeneous and researchers want to understand the full distribution of effects, methods like causal forests provide complementary tools.

Common Confusions

Frequently Asked Questions

Q: Why are IV standard errors larger than OLS? IV uses only the variation in $D$ driven by the instrument, which is a subset of the total variation. Less variation means less precision and larger standard errors. This imprecision is the price of addressing endogeneity.
Q: Can I use more instruments than endogenous variables? Yes — this situation is called the over-identified case. With $m$ instruments and $k$ endogenous variables ( $m > k$ ), you can test whether the instruments are consistent with each other using the Sargan-Hansen J-test. But having more instruments does not necessarily help: with many weak instruments, bias toward OLS actually increases.
Q: What if my instrument is only slightly correlated with the endogenous variable? This situation is the problem. When the first-stage F-statistic falls below the Staiger-Stock (1997) screening threshold of 10 — or, in the just-identified case, below the Lee-McCrary-Moreira-Porter (2022) F > 104.7 threshold for valid 5% t-test inference — the IV estimator is badly biased toward OLS, confidence intervals have poor coverage, and test statistics are unreliable.

BIdentification

Formal Conditions

For the model $Y_i = \beta D_i + \varepsilon_i$ (suppressing controls for clarity), the IV estimator is consistent if:

$E[Z_i \varepsilon_i] = 0$ (exogeneity / exclusion)
$E[Z_i D_i] \neq 0$ (relevance)

Under these conditions:

\hat{\beta}_{IV} = \frac{\widehat{\text{Cov}}(Z, Y)}{\widehat{\text{Cov}}(Z, D)} \xrightarrow{p} \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, D)} = \frac{\beta \cdot \text{Cov}(Z, D)}{\text{Cov}(Z, D)} = \beta

The LATE Theorem

With a binary instrument $Z \in \{0,1\}$ and binary treatment $D \in \{0,1\}$ , there are four types of units:

Compliers: $D(1) = 1, D(0) = 0$ — take treatment when encouraged, do not when not
: $D(1) = D(0) = 1$ — always take treatment
: $D(1) = D(0) = 0$ — never take treatment
: $D(1) = 0, D(0) = 1$ — do the opposite of encouragement

Under the monotonicity assumption (no defiers), the

\hat{\beta}_{IV} = \frac{E[Y \mid Z=1] - E[Y \mid Z=0]}{E[D \mid Z=1] - E[D \mid Z=0]}

identifies the LATE: the average treatment effect for compliers (Imbens & Angrist, 1994).

CVisual Intuition

Picture a scatter plot of $Y$ vs. $D$ where $D$ is endogenous. The OLS line through this cloud is biased because unobserved factors push both $D$ and $Y$ in the same direction.

Now imagine the instrument $Z$ as a lever that shifts $D$ left or right. Some units get a "push" (high $Z$ ) and others do not (low $Z$ ). The IV estimator looks at how much $Y$ changes per unit of $Z$ -induced change in $D$ . It ignores the endogenous part of $D$ entirely.

Think of it as plumbing. The ordinary relationship between $D$ and $Y$ is contaminated — dirty water (endogeneity) flows in. The instrument $Z$ provides a clean source of variation in $D$ . By isolating the variation in $D$ that comes from $Z$ (the clean water), you can estimate the effect of $D$ on $Y$ without contamination.

When confounding is zero, IV and OLS agree. As confounding increases, OLS drifts away from the truth while IV stays on target — but with wider confidence intervals. When instrument strength approaches zero, IV becomes erratic (the weak instrument problem).

Explore how treatment effect heterogeneity across complier subpopulations affects the LATE interpretation:

See how the relative bias of IV estimation depends on the first-stage F-statistic. The danger zone — below the Staiger-Stock (1997) screening rule of F > 10, and well below the Lee-McCrary-Moreira-Porter (2022) just-identified threshold of F > 104.7 — produces severe bias:

DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: 2SLS projects the endogenous regressors onto the instrument space, then runs OLS on the projected values. The formula is beta-hat-IV equals (X'Pz X) inverse times (X'Pz Y).

Let $\mathbf{Z}$ be the matrix of instruments (and exogenous regressors), $\mathbf{X}$ be the matrix of all regressors (including endogenous ones), and $\mathbf{Y}$ the outcome vector.

Step 1: Project X onto instrument space.

P_Z = Z(Z'Z)^{-1}Z'

\hat{X} = P_Z X

Step 2: OLS of Y on projected X.

\hat{\beta}_{2SLS} = (\hat{X}'\hat{X})^{-1}\hat{X}'Y = (X'P_Z X)^{-1} X'P_Z Y

Consistency: Substitute $Y = X\beta + u$ :

\hat{\beta}_{2SLS} = \beta + (X'P_Z X)^{-1} X'P_Z u

Under $E[Z'u] = 0$ and relevance ( $\text{rank}(E[Z'X]) = k$ ):

\hat{\beta}_{2SLS} \xrightarrow{p} \beta

Variance (robust):

\hat{V}_{2SLS} = (X'P_Z X)^{-1} \left(\sum_i \hat{\varepsilon}_i^2 \hat{X}_i \hat{X}_i'\right) (X'P_Z X)^{-1}

Important: Standard errors in Stage 2 must be computed using the original $X$ , not $\hat{X}$ . Running two separate OLS regressions manually gives incorrect SEs. It is important to use a dedicated 2SLS command.

Bias of 2SLS in finite samples:

E[\hat{\beta}_{2SLS} - \beta] \approx \frac{1}{F+1} \cdot E[\hat{\beta}_{OLS} - \beta]

where $F$ is the first-stage F-statistic. In the over-identified case (multiple instruments), the relative bias of 2SLS toward OLS is approximately $1/(F+1)$ (Angrist & Pischke, 2009); for large $F$ this is well approximated by $1/F$ , and as $F$ grows the bias vanishes. In the just-identified case (one instrument), the population mean of the 2SLS estimator is undefined (the IV estimator has no first moment under standard conditions), so the bias-toward-OLS formula does not apply; weak instruments instead manifest as median bias and severe distortion of confidence-interval coverage. With single-endogenous-variable, multiple-instrument settings under non-spherical errors, the standard first-stage F-statistic can be misleadingly large; the effective F-statistic of Montiel Olea and Pflueger (2013) is the appropriate measure in that case.

LIML (Limited Information Maximum Likelihood): An alternative to 2SLS that is less biased with weak instruments. In just-identified models (one instrument per endogenous variable), LIML and 2SLS are numerically identical.

EImplementation

1# Requires: fixest, ivreg, ivmodel
2library(fixest)   # fixest: fast FE estimation with built-in IV support
3library(ivreg)    # ivreg: classic 2SLS with comprehensive diagnostics
4
5# --- Step 1: 2SLS estimation with fixest ---
6# feols IV syntax: outcome ~ exogenous | FE | endogenous ~ instrument
7# institutions is endogenous; settler_mortality is the instrument
8# vcov = "HC1": heteroskedasticity-robust standard errors
9iv_fit <- feols(gdp_pc ~ controls | 0 | institutions ~ settler_mortality,
10              data = df, vcov = "HC1")
11summary(iv_fit)
12# Coefficient on institutions: causal effect (LATE for compliers)
13
14# --- Step 2: First-stage diagnostics ---
15# First-stage F: instrument strength (rule of thumb: F > 10, ideally > 104.7)
16fitstat(iv_fit, type = "ivf")
17# Sargan over-ID test: valid only with multiple instruments (H0: all instruments valid)
18fitstat(iv_fit, type = "sargan")
19
20# --- Step 3: 2SLS with ivreg (richer diagnostics) ---
21# ivreg formula: outcome ~ endogenous + exogenous | instruments + exogenous
22iv_fit2 <- ivreg(gdp_pc ~ institutions + controls | settler_mortality + controls,
23               data = df)
24# diagnostics = TRUE: reports weak instrument test, Wu-Hausman, and Sargan tests
25summary(iv_fit2, diagnostics = TRUE)
26
27# --- Step 4: LIML (less biased with weak instruments) ---
28# LIML is approximately median-unbiased even when instruments are weak
29# In the just-identified case (1 instrument, 1 endogenous), LIML = 2SLS
30library(ivmodel)   # ivmodel: weak-instrument-robust IV inference
31iv_mod <- ivmodel(Y = df$gdp_pc, D = df$institutions, Z = df$settler_mortality,
32                X = as.matrix(df[, "controls", drop = FALSE]))
33LIML(iv_mod)
34# If LIML and 2SLS differ substantially, weak instruments are a concern

Requiresfixest ivreg ivmodel

FDiagnostics

First-Stage F-Statistic

A central diagnostic for IV. The rule of thumb $F > 10$ was proposed by Staiger and Stock (1997); Stock and Yogo (2005) formalized it with critical values for size distortion.

More recent guidance from Lee et al. (2022) shows that the standard first-stage F must exceed 104.7 for the conventional $t$ -ratio critical value of 1.96 to control size at 5% in the just-identified case. Below that threshold, the authors provide an adjusted critical-value function ( $tF$ procedure) that remains valid. In practice, F-statistics between 10 and 104.7 warrant the use of weak-instrument-robust inference methods such as the Anderson-Rubin test or the $tF$ procedure.

Reduced Form

It is recommended to report the regression: regress $Y$ directly on $Z$ (and controls). If the reduced form is insignificant, IV will be imprecise (even if the first stage is strong). The reduced form is the intent-to-treat (ITT) analog in the IV framework.

Over-Identification Test (Sargan-Hansen J-Test)

When you have more instruments than endogenous variables, the J-test checks whether the instruments are consistent with each other. A rejection suggests at least one instrument violates the exclusion restriction. But beware: the test has low power when all instruments are invalid in the same direction. Rejection may also indicate treatment effect heterogeneity rather than instrument invalidity (Angrist & Pischke, 2009).

Weak Instrument Robust Inference

When the first-stage F is low, use the Anderson-Rubin (AR) test, which is valid regardless of instrument strength. The AR confidence set inverts a test of the reduced-form null and has correct size even with arbitrarily weak instruments.

Hausman/Durbin-Wu-Hausman Test (OLS vs. IV)

Compare OLS and IV estimates formally. Under the null that OLS is consistent, OLS and IV should give similar estimates. If they differ significantly, OLS is likely biased. The regression-based Durbin-Wu-Hausman test is the practical implementation: estimate the first stage, obtain residuals, include them in the structural equation, and test their significance. Under the null of exogeneity, the coefficient on the residuals is zero .

Interpreting Your Results

Sign and magnitude: IV estimates are often larger than OLS (in absolute value). Two common explanations: (1) measurement error in $D$ attenuates OLS estimates (), and IV corrects this attenuation; (2) IV identifies the LATE for compliers, who may respond more strongly to treatment than the average person. These explanations are not mutually exclusive, and distinguishing between them requires additional analysis.
Precision: IV estimates are typically much less precise than OLS. Wide confidence intervals are the norm, not the exception.
Compliers: Think carefully about who the compliers are. In the Acemoglu et al. example, compliers are countries whose institutional quality was determined by settler mortality. In the Angrist and Krueger (1991) quarter-of-birth example, compliers are people who got more education because of compulsory schooling laws — they would have dropped out sooner without the laws.

GWhat Can Go Wrong

Problem	What It Does	How to Fix It
Weak instruments ( $F < 10$ )	IV is biased toward OLS; confidence intervals have wrong coverage	Use Anderson-Rubin test; find a stronger instrument; use LIML
Exclusion restriction violated	IV is biased in an unknown direction; bias may be worse than OLS	Argue substantively; sensitivity analysis (Conley et al., 2012)
Manual two-step OLS	Running 2SLS manually with two separate OLS regressions gives wrong SEs	Use dedicated IV commands
Forbidden regression	Using nonlinear first-stage predictions (e.g., from probit) in a linear second stage — identification comes from functional form, not the exclusion restriction	Use linear first stage, or the control function approach
Many weak instruments	Bias toward OLS increases with number of instruments	Use LIML estimator or JIVE; reduce instrument count
Heterogeneous effects ignored	Interpreting LATE as ATE when they differ	Discuss complier characteristics; present OLS alongside
Instrument not excludable	Direct effect of Z on Y biases the IV estimate	Argue exclusion carefully; sensitivity analysis

What Can Go Wrong

Weak Instrument Bias (F < 10)

Strong instrument: settler mortality has a first-stage F-statistic of 22.9

IV estimate of effect of institutions on log GDP: 0.94 (SE = 0.16). Bias relative to OLS is approximately 1/(22.9+1) ≈ 4%. The IV estimate is reliable.

What Can Go Wrong

Exclusion Restriction Violation

Rainfall affects civil conflict only through its effect on economic growth (the endogenous variable)

IV estimate of growth on conflict: -0.12 (SE = 0.04). If rainfall has no direct effect on conflict except through the economy, the exclusion restriction holds and the estimate is consistent.

What Can Go Wrong

Manual Two-Step OLS (Wrong Standard Errors)

Use a dedicated 2SLS command that computes standard errors correctly

ivregress 2sls y (D = Z), vce(robust). SE = 0.16. Correct inference because the command uses the original D (not fitted D-hat) in the variance formula.

For relaxing the exclusion restriction, see the plausibly exogenous instruments framework (Conley et al., 2012).

HPractice

Guided Exercise

IV Validity: Instrumenting Military Service with Draft Lottery Numbers

Angrist (1990) estimates the effect of Vietnam-era military service on long-run earnings. The problem is that who serves is not random — men from disadvantaged backgrounds are more likely to enlist. The famous solution is to use draft lottery numbers (randomly assigned by birth date) as an instrument for military service.

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the effect of immigration on native wages. They instrument local immigrant share with historical immigrant settlement patterns (a shift-share instrument). Using data from 200 metropolitan areas, they report:

ivregress 2sls native_wage controls (immigrant_share = historical_share), vce(robust)

Coefficient on immigrant_share: -0.35 (SE = 0.12, p = 0.004). First-stage F = 45. "We find that a 1 percentage point increase in immigrant share causes a 0.35% decrease in native wages. The strong first stage (F = 45) confirms instrument validity."

Select all errors you can find:

Conflating first-stage strength with instrument validity(Interpretation of F-statistic)

No discussion of the exclusion restriction for the shift-share instrument(Missing identification argument)

Error Detective

Read the analysis below carefully and identify the errors.

A finance researcher instruments CEO overconfidence (measured by option exercise behavior) with the CEO's birth order (first-born vs. later-born), citing psychology literature that first-borns are more confident. Using a cross-section of 800 firms:

First-stage F-statistic: 6.8 IV estimate of overconfidence on firm investment: 0.42 (SE = 0.25, p = 0.09)

"Although the first-stage F is below 10, the IV estimate is marginally significant, suggesting overconfident CEOs invest more aggressively. Birth order is clearly exogenous because it is determined at birth."

Select all errors you can find:

Proceeding with weak-instrument inference despite F = 6.8(Weak instrument problem)

Claiming birth order is exogenous because it is predetermined(Exclusion restriction argument)

Concept Check

A researcher instruments 'years of schooling' with 'quarter of birth' to estimate the return to education. The first-stage F-statistic is 4.2. What is the main concern?

No concerns — the coefficient is positive and the instrument is valid.Weak instrument bias: with F = 4.2, the IV estimate is biased toward OLS, and standard confidence intervals have incorrect coverage.The exclusion restriction is fine because quarter of birth is random.She should add more instruments to increase the F-statistic.

Concept Check

You estimate the effect of institutions on GDP using IV (settler mortality instrument) and OLS. The OLS estimate is 0.52 (SE = 0.06) and the IV estimate is 0.94 (SE = 0.16). Both are statistically significant. Why might the IV estimate be nearly twice as large?

The IV estimate is wrong because the standard errors are larger.OLS is attenuated by measurement error in institutional quality, and IV corrects this attenuation; and/or reverse causality biases OLS downward (rich countries improve institutions).OLS is more precise and therefore more accurate.The instrument is too strong.

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

A study examines whether R&D spending affects firm revenue. The authors instrument R&D spending with 'industry-average R&D spending' (the average R&D of all OTHER firms in the same industry). Using a panel of 5,000 firms over 10 years, they find a first-stage F-statistic of 28 and estimate that a $1M increase in R&D raises revenue by $4.2M (p < 0.01). They include firm and year fixed effects.

Key Table

Variable	Coefficient	Robust SE	p-value
R&D (instrumented)	4.200	1.100	0.000
Firm Size (log)	0.015	0.006	0.012
Firm FE	Yes
Year FE	Yes
First-stage F	28
N	50,000

Authors' Identification Claim

Industry-average R&D (excluding the focal firm) is correlated with the focal firm's R&D through technology spillovers but is uncorrelated with firm-specific revenue shocks.

ISwap-In: When to Use Something Else

OLS with controls: When conditional exogeneity (selection on observables) is more credible than the exclusion restriction, and a rich set of covariates is available.
Regression discontinuity: When treatment is assigned by a threshold on a running variable — RDD provides a more transparent and locally randomized design.
Difference-in-differences: When a policy change provides before/after and treated/untreated variation without requiring an instrument.
Matching: When selection into treatment is primarily on observables and the overlap condition is satisfied.
Reduced form only: When the instrument is valid but weak (F < 10), reporting the reduced-form effect of the instrument on the outcome avoids the bias amplification of 2SLS.

JReviewer Checklist

Paper Library

Has replication code

Foundational (12)

Angrist, J. D. (1990). Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records.

American Economic Review

Angrist uses the Vietnam-era draft lottery as a natural experiment in this landmark application of instrumental variables. He shows that randomly assigned lottery numbers provide an instrument for military service, allowing causal estimation of the earnings effect of military service.

Angrist, J. D., & Krueger, A. B. (1991). Does Compulsory School Attendance Affect Schooling and Earnings?.

Quarterly Journal of EconomicsDOI: 10.2307/2937954

Angrist and Krueger use quarter of birth as an instrument for years of schooling, exploiting the fact that compulsory schooling laws interact with birth timing. This paper is one of the most-taught examples of instrumental variables in economics and also sparked important debates about weak instruments.

Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1996.10476902

Angrist, Imbens, and Rubin formalize the LATE framework — originally introduced in Imbens and Angrist (1994) — within the Rubin Causal Model, providing a detailed treatment of the assumptions required for causal interpretation of IV estimates. This paper introduces the complier taxonomy (always-takers, never-takers, compliers, defiers) that is now standard in the IV literature. The practical implication is that IV estimates should be interpreted as local to the complier subpopulation, not as average effects for the entire population.

Bound, J., Jaeger, D. A., & Baker, R. M. (1995). Problems with Instrumental Variables Estimation When the Correlation Between the Instruments and the Endogenous Explanatory Variable Is Weak.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1995.10476536

Bound, Jaeger, and Baker demonstrate that instrumental variables estimates can be severely biased when instruments are weakly correlated with the endogenous regressor. They show that with weak instruments, the finite-sample bias of IV approaches that of OLS, and that the standard IV confidence intervals can have coverage far below their nominal levels. The paper motivates the widespread practice of reporting first-stage F-statistics as a diagnostic for instrument strength.

Conley, T. G., Hansen, C. B., & Rossi, P. E. (2012). Plausibly Exogenous.

Review of Economics and StatisticsDOI: 10.1162/REST_a_00139

Conley, Hansen, and Rossi develop methods for inference when the exclusion restriction is 'plausibly' rather than exactly satisfied, parameterizing the degree of violation and constructing valid confidence intervals. This approach provides a formal sensitivity analysis for IV estimates, answering the question: how large would the violation of the exclusion restriction need to be to overturn the result? Applied researchers can use these methods to transparently assess the robustness of IV findings to a common critique.

Frake, J., Gibbs, A., Goldfarb, B., Hiraiwa, T., Starr, E., & Yamaguchi, S. (2025). From Perfect to Practical: Partial Identification Methods for Causal Inference in Strategic Management Research.

Strategic Management JournalDOI: 10.1002/smj.3714

Frake and colleagues introduce partial identification methods to strategic management, providing a practical framework for assessing the sensitivity of difference-in-differences and instrumental variables estimates to violations of identifying assumptions. The paper demonstrates how researchers can construct informative bounds on treatment effects when parallel trends or exclusion restriction assumptions are relaxed. It bridges the gap between the theoretical ideal of point identification and the practical reality that identifying assumptions are rarely perfectly satisfied.

Imbens, G. W., & Angrist, J. D. (1994). Identification and Estimation of Local Average Treatment Effects.

EconometricaDOI: 10.2307/2951620

Imbens and Angrist show that IV identifies the average causal effect for compliers -- the subpopulation whose treatment status is changed by the instrument -- under the monotonicity assumption, in this foundational paper on LATE. This reinterpretation fundamentally changed how researchers understand what IV estimates.

Lee, D. S., McCrary, J., Moreira, M. J., & Porter, J. (2022). Valid t-Ratio Inference for IV.

American Economic ReviewDOI: 10.1257/aer.20211063

Lee, McCrary, Moreira, and Porter address the potentially severe large-sample distortions of t-ratio-based inference in the single-IV model. They introduce the tF critical value function, a standard error adjustment that is a smooth function of the first-stage F-statistic, which corrects for weak instrument bias. They find that for one-quarter of specifications in 61 AER papers, corrected standard errors are at least 49% larger than conventional 2SLS standard errors at the 5% significance level. The practical implication is that researchers using IV should apply their tF correction rather than relying on conventional standard errors.

Manski, C. F. (1993). Identification of Endogenous Social Effects: The Reflection Problem.

Review of Economic StudiesDOI: 10.2307/2298123

Manski formalizes the reflection problem in the analysis of social interactions: when individual outcomes depend on group averages, the group average is simultaneously determined by its members. This simultaneity makes it impossible to distinguish true social (endogenous) effects from correlated effects without additional structure or exclusion restrictions. The paper is essential reading for any researcher attempting to estimate peer effects or social spillovers.

Montiel Olea, J. L., & Pflueger, C. (2013). A Robust Test for Weak Instruments.

Journal of Business & Economic StatisticsDOI: 10.1080/00401706.2013.806694

Montiel Olea and Pflueger propose an effective F-statistic for testing weak instruments that is robust to heteroscedasticity, serial correlation, and clustering — unlike the conventional first-stage F. The effective F is now the standard diagnostic for instrument strength in applied IV research.

Staiger, D., & Stock, J. H. (1997). Instrumental Variables Regression with Weak Instruments.

EconometricaDOI: 10.2307/2171753

Staiger and Stock show formally that when instruments are weak, 2SLS estimates are biased toward OLS and standard inference breaks down. This paper establishes the theoretical foundations for the weak instruments problem that Stock and Yogo (2005) later provided practical tests for.

Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression.

Identification and Inference for Econometric Models: Essays in Honor of Thomas RothenbergDOI: 10.1017/CBO9780511614491.006

Stock and Yogo develop formal critical value tables for testing whether instruments are 'weak'—that is, only weakly correlated with the endogenous variable. Their tables formalize the Staiger and Stock (1997) rule of thumb that the first-stage F-statistic should exceed 10, and are probably the most widely used diagnostic in applied IV research.

Application (7)

Acemoglu, D., Johnson, S., & Robinson, J. A. (2001). The Colonial Origins of Comparative Development: An Empirical Investigation.

American Economic ReviewDOI: 10.1257/aer.91.5.1369

Acemoglu, Johnson, and Robinson use historical settler mortality as an instrument for institutional quality to estimate the causal effect of institutions on economic development in this celebrated paper. It is one of the most influential IV applications in economics and demonstrates the creativity required to find a plausible instrument.

Albouy, D. Y. (2012). The Colonial Origins of Comparative Development: An Empirical Investigation: Comment.

American Economic ReviewDOI: 10.1257/aer.102.6.3059

Albouy critically re-examines the settler mortality instrument used in Acemoglu et al. (2001), showing that the original results are sensitive to data coding decisions and the sample of countries included. This comment is a cautionary tale about instrument validity and the fragility of influential IV estimates.

Bennedsen, M., Nielsen, K. M., Pérez-González, F., & Wolfenzon, D. (2007). Inside the Family Firm: The Role of Families in Succession Decisions and Performance.

Quarterly Journal of EconomicsDOI: 10.1162/qjec.122.2.647

Bennedsen et al. use exogenous variation in CEO succession decisions driven by the gender of the departing CEO's firstborn child to study the effect of family versus professional management on firm performance. A widely cited example of using a natural experiment to address endogeneity in corporate governance research.

Bloom, N., & Van Reenen, J. (2007). Measuring and Explaining Management Practices Across Firms and Countries.

Quarterly Journal of EconomicsDOI: 10.1162/qjec.2007.122.4.1351

Bloom and Van Reenen develop a survey-based measure of management practices and document that better management is strongly associated with higher productivity, profitability, and growth. They use IV strategies (including product market competition and primogeniture rules for family management succession) to investigate why management quality varies, finding that poor management is more prevalent when competition is weak and when family firms follow primogeniture. The paper is foundational for the measurement of management practices; the IV analysis is one component of a broader measurement and descriptive study.

Levitt, S. D. (1997). Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime.

American Economic Review

Levitt uses the timing of mayoral and gubernatorial elections as an instrument for police hiring to estimate the causal effect of police on crime. The paper illustrates the IV approach in a policy-relevant setting where the key concern is reverse causality (more crime leads to more police).

Miguel, E., Satyanath, S., & Sergenti, E. (2004). Economic Shocks and Civil Conflict: An Instrumental Variables Approach.

Journal of Political EconomyDOI: 10.1086/421174

Miguel, Satyanath, and Sergenti instrument for economic growth using rainfall variation to estimate the causal effect of economic shocks on civil conflict in Sub-Saharan Africa. Their paper is a clean and widely cited example of using weather as an instrumental variable, illustrating both the power and the exclusion restriction challenges of weather-based instruments.

Young, A. (2022). Consistency Without Inference: Instrumental Variables in Practical Application.

European Economic ReviewDOI: 10.1016/j.euroecorev.2022.104112

Young reexamines published IV applications and argues that standard first-stage F-statistic diagnostics are largely uninformative of both size and bias under non-iid errors and high leverage. The paper finds that IV estimates in practice rarely demonstrate that OLS is biased, raising broader questions about the reliability of IV as commonly implemented.

Survey (8)

Andrews, I., Stock, J. H., & Sun, L. (2019). Weak Instruments in Instrumental Variables Regression: Theory and Practice.

Annual Review of EconomicsDOI: 10.1146/annurev-economics-080218-025643

Andrews, Stock, and Sun provide an up-to-date review of the weak instruments problem, covering modern diagnostic tests, robust inference procedures, and practical recommendations. It is an excellent starting point for understanding the current best practices in IV estimation.

Angrist, J. D., & Krueger, A. B. (2001). Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments.

Journal of Economic PerspectivesDOI: 10.1257/jep.15.4.69

Angrist and Krueger trace the evolution of IV from its origins in supply-and-demand estimation to modern natural experiments in this historical survey. They provide valuable context for understanding how IV methodology developed and why it becomes central to applied economics.

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.

Cunningham, S. (2021). Causal Inference: The Mixtape.

Yale University PressDOI: 10.12987/9780300255881 Replication

Cunningham provides an accessible textbook with an excellent DiD chapter that walks through the intuition, the math, and the code (in Stata and R). Freely available online at mixtape.scunning.com, it is a valuable companion for students who want worked examples alongside formal treatment.

Murray, M. P. (2006). Avoiding Invalid Instruments and Coping with Weak Instruments.

Journal of Economic PerspectivesDOI: 10.1257/jep.20.4.111

Murray provides practical guidance on evaluating instrument validity and dealing with weak instruments in applied work. Written in an accessible style, it helps applied researchers think critically about their instrument choices and provides concrete strategies for addressing common IV pitfalls.

Semadeni, M., Withers, M. C., & Certo, S. T. (2014). The Perils of Endogeneity and Instrumental Variables in Strategy Research: Understanding through Simulations.

Strategic Management JournalDOI: 10.1002/smj.2136

Semadeni, Withers, and Certo use Monte Carlo simulations to demonstrate the dangers of using weak or invalid instruments in strategy research. They provide practical guidance for management scholars on when and how to use IV, and when it may do more harm than good.

Stock, J. H., Wright, J. H., & Yogo, M. (2002). A Survey of Weak Instruments and Weak Identification in Generalized Method of Moments.

Journal of Business & Economic StatisticsDOI: 10.1198/073500102288618658

Stock, Wright, and Yogo combine a review of weak-identification problems in GMM and IV with original contributions on weak-identification-robust inference. The paper foreshadows the formal critical-value tables that now appear in Stock and Yogo (2005).

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.

MIT Press

Wooldridge's graduate textbook covers duration and hazard models in Chapter 22, including the Cox proportional hazard model, parametric alternatives (Weibull, exponential), and the treatment of censoring and truncation in survival data.

Instrumental Variables / 2SLS

One-Line Implementation

Download Full Analysis Code

Motivating Example: Colonial Origins of Comparative Development

AOverview

The Endogeneity Problem

The IV Solution

Two-Stage Least Squares (2SLS)

IV Estimates as LATE

Common Confusions

BIdentification

Formal Conditions

The LATE Theorem

CVisual Intuition

DMathematical Derivation

EImplementation

FDiagnostics

First-Stage F-Statistic

Reduced Form

Over-Identification Test (Sargan-Hansen J-Test)

Weak Instrument Robust Inference

Hausman/Durbin-Wu-Hausman Test (OLS vs. IV)

Interpreting Your Results

GWhat Can Go Wrong

Weak Instrument Bias (F < 10)

Exclusion Restriction Violation

Manual Two-Step OLS (Wrong Standard Errors)

HPractice

Paper Summary

Key Table

Authors' Identification Claim

ISwap-In: When to Use Something Else

JReviewer Checklist

Critical Reading Checklist

Paper Library

Foundational (12)

Application (7)

Survey (8)

Tags

One-Line Implementation

Download Full Analysis Code

Motivating Example: Colonial Origins of Comparative Development#

AOverview#

The Endogeneity Problem#

The IV Solution#

Two-Stage Least Squares (2SLS)#

IV Estimates as LATE#

Common Confusions#

BIdentification#

Formal Conditions#

The LATE Theorem#

CVisual Intuition#

DMathematical Derivation#

EImplementation#

FDiagnostics#

First-Stage F-Statistic#

Reduced Form#

Over-Identification Test (Sargan-Hansen J-Test)#

Weak Instrument Robust Inference#

Hausman/Durbin-Wu-Hausman Test (OLS vs. IV)#

Interpreting Your Results#

GWhat Can Go Wrong#

Weak Instrument Bias (F < 10)

Exclusion Restriction Violation

Manual Two-Step OLS (Wrong Standard Errors)

HPractice#

Paper Summary

Key Table

Authors' Identification Claim

ISwap-In: When to Use Something Else#

JReviewer Checklist#

Critical Reading Checklist

Paper Library

Foundational (12)

Application (7)

Survey (8)

Tags

Motivating Example: Colonial Origins of Comparative Development

AOverview

The Endogeneity Problem

The IV Solution

Two-Stage Least Squares (2SLS)

IV Estimates as LATE

Common Confusions

BIdentification

Formal Conditions

The LATE Theorem

CVisual Intuition

DMathematical Derivation

EImplementation

FDiagnostics

First-Stage F-Statistic

Reduced Form

Over-Identification Test (Sargan-Hansen J-Test)

Weak Instrument Robust Inference

Hausman/Durbin-Wu-Hausman Test (OLS vs. IV)

Interpreting Your Results

GWhat Can Go Wrong

HPractice

ISwap-In: When to Use Something Else

JReviewer Checklist