Guide

How to Read an Empirical Paper

A systematic framework for reading causal inference papers critically. Learn to identify the estimand, evaluate identification strategies, assess evidence quality, and judge credibility.

Why You Need a Framework

Reading an empirical paper is not like reading a novel. You cannot start at the beginning, proceed linearly, and expect understanding to emerge at the end. Empirical papers are structured around an argument — a claim that a particular research design credibly identifies a causal effect — and your job as a reader is to evaluate that argument systematically.

Without a framework, you will fall into one of two traps. Either you will accept the paper's claims uncritically because the methods look sophisticated, or you will reject them reflexively because you can always imagine some confounder. Neither is productive. What you need is a structured approach that helps you assess how credible the evidence is and what would need to be true for the conclusions to hold.

Step 1: Identify the Research Question

Before you look at a single equation or table, identify the core research question. It should be expressible as: What is the causal effect of X on Y?

Ask yourself three things:

Is this a causal question? Many papers use causal language ("the effect of," "the impact of") without actually pursuing a causal estimand. Descriptive and predictive questions are valuable, but they require different evaluation criteria.
What is the estimand? Is the paper targeting the ATE (average treatment effect for the full population), the ATT (for the treated), or the LATE (for compliers)? If the paper does not state this explicitly, that omission is a warning sign.
Is the question answerable with the available data and design? Some questions are important but cannot be credibly answered with the data at hand. Recognizing this distinction is a key skill.

Step 2: Map the Identification Strategy

This section is the heart of the paper. The identification strategy is the argument for why the empirical design recovers a causal effect rather than a mere correlation. You need to answer:

What is the source of exogenous variation? Randomization? A natural experiment (policy change, cutoff, instrument)? Selection on observables? The taxonomy of identification strategies provides a useful map.
What are the identifying assumptions? Every method has assumptions. Difference-in-differences requires parallel trends. Instrumental variables requires exclusion restriction and relevance. RDD requires no manipulation of the running variable. What does this paper assume?
Are these assumptions testable? Some assumptions can be partially tested (e.g., pre-trends for DiD, McCrary test for RDD). Others are fundamentally untestable (e.g., the exclusion restriction for IV). Does the paper test what it can?
Draw the DAG. If the paper does not provide one, draw it yourself. Place the treatment, outcome, and all mentioned confounders on the graph. Draw arrows. This exercise will immediately reveal what the identification strategy controls for, what it leaves open, and where the key threats lie.

Concept Check

A paper uses an instrumental variable (distance to the nearest college) to estimate the effect of education on earnings. What is the most important untestable assumption?

The instrument is relevant (distance affects college attendance).The exclusion restriction (distance affects earnings only through education).The sample is representative of the U.S. population.The error term is normally distributed.

Step 3: Evaluate the Evidence

Once you understand the identification strategy, evaluate how convincingly the paper implements it. Work through this checklist:

Data quality. Are the data appropriate for the question? Is the sample large enough? Are key variables measured well, or are they noisy proxies?

Balance and pre-treatment checks. For experiments and matching designs, are treated and control groups balanced on observables? For DiD, does the event study show flat pre-trends? For RDD, is there bunching at the cutoff (McCrary test)?

Main results. Look beyond the headline coefficient. Check the standard errors — are they clustered at the right level? Look at the confidence interval — is it informatively narrow or embarrassingly wide? Check the sample size — does it change across specifications?

Robustness. Does the paper test sensitivity to alternative specifications, different samples, placebo outcomes, and violations of key assumptions? Papers that use Oster (2019) bounds or Cinelli and Hazlett (2020) sensitivity analysis earn extra credibility.

Multiple testing. If the paper examines many outcomes, are p-values adjusted for multiple comparisons? Cherry-picking the significant results from a battery of tests inflates false positive rates dramatically.

Step 4: Assess External Validity

Internal validity — whether the paper correctly estimates the effect for its sample — is necessary but not sufficient. You also need to ask:

Who is the estimate about? A LATE from an IV applies only to compliers. An RDD estimate is local to units near the cutoff. A DiD estimate is an ATT for the treated group. None of these estimands is necessarily the ATE for the full population.
Would the effect generalize? Would the same treatment have the same effect in a different country, time period, or population? Is there a theoretical reason to expect the effect to be stable or to vary?
How was the sample selected? Convenience samples (one firm, one university, one platform) may produce perfectly valid internal estimates that tell you little about the broader population.

Step 5: Read the Tables Like a Reviewer

Do not just scan for asterisks. For each table:

Read the column headers. What changes across columns? Usually controls, fixed effects, or sample definitions. Understand the progression.
Track the coefficient on the treatment variable across columns. Does it move substantially when controls are added? Large movements suggest omitted variable bias in the sparser specifications.
Check the standard errors. Are they robust? Clustered? At what level? The wrong clustering level can dramatically overstate precision.
Compare the effect size to the dependent variable mean. A coefficient of 0.5 means very different things if the mean is 2 versus 200. In most settings, compute the percentage effect.
Watch for sample size changes. If N drops dramatically between columns, ask why. Dropping observations is often a sign of missing data that may not be random.

Step 6: Write Your One-Page Summary

After reading, fill in this template. If you cannot complete every line, you have not read carefully enough:

Question: What is the causal effect of ___ on ___?
Estimand: ATE / ATT / LATE / other
Identification: [Method] exploiting [source of variation]
Key assumption: [State it precisely]
Main result: [Coefficient, CI, sample size]
Robustness: [Which checks were run? Which passed?]
Biggest threat: [What could most plausibly invalidate the result?]
External validity: [Who does this apply to? Who does it not apply to?]

Interactive Exercise

Seminar Cold Call

You are in a research seminar and the professor calls on you. You have 60 seconds to answer 5 questions about the regression table below. Read the table carefully, then click “Start” when you are ready.

Effect of Training Treatment on Log(Wages)

Dependent variable: Log(Wages)

Variable	Coefficient
Training Treatment	-0.063* (0.036)
Age	-0.084*** (0.023)
Race dummies	0.044 (0.037)
Years of education	-0.019* (0.010)

Controls:Age, Race dummies, Years of education, Experience, Industry dummies, Experience squared

Clustering:county

N:34,000

R²:0.508

*** p<0.01, ** p<0.05, * p<0.1. Standard errors in parentheses.

Study the regression table on the left. When you are ready, click Start to begin the 60-second challenge.

Why this matters: Reading regression tables quickly and accurately is a core skill for any empirical researcher. In seminars, job talks, and referee reports, you need to rapidly extract key information: the point estimate, its significance, what controls are included, the sample size, and potential threats to identification. Practice builds fluency.

Interactive Table Reading Practice

Reading regression tables is a skill that improves with practice. The three annotated tables below span different empirical methods. Click on any cell marked with * to see a plain-language explanation of what that element means. Use the "Test Yourself" button to hide annotations and practice interpreting tables on your own.

Table 1: OLS Earnings Regression (Mincer-Style)

This table illustrates a classic returns-to-education regression, progressively adding controls and adjusting standard errors. Watch how the coefficient on education changes as the specification becomes more credible.

Annotated Table

Click any cell with * to see its annotation. Use arrow keys to navigate.

	(1) Baseline	(2) With Controls	(3) Robust SEs
Years of Education	0.107***	0.084***	0.084***
	(0.004)	(0.005)	(0.008)
Experience		0.032***	0.032***
		(0.002)	(0.003)
Experience Squared / 100		-0.045***	-0.045***
		(0.008)	(0.010)
Female		-0.241***	-0.241***
		(0.012)	(0.015)
Black		-0.138***	-0.138***
		(0.016)	(0.020)
Constant	0.984***	0.587***	0.587***
	(0.051)	(0.072)	(0.089)
Observations	49,531	49,531	49,531
R-squared	0.091	0.274	0.274
Standard Errors	Classical	Classical	Robust (HC1)

Table 2: Difference-in-Differences with Event Study

This table presents a DiD analysis of a hypothetical minimum wage increase, showing both the standard two-period DiD estimate and event study coefficients. Pay attention to how the pre-treatment coefficients assess the parallel trends assumption.

Annotated Table

Click any cell with * to see its annotation. Use arrow keys to navigate.

	(1) Simple DiD	(2) Event Study	(3) DiD + Controls
Treated x Post	−0.038**		−0.032**
	(0.015)		(0.014)
Pre-trend: t = −3		0.005
		(0.012)
Pre-trend: t = −2		0.008
		(0.011)
Pre-trend: t = −1		−0.003
		(0.010)
Post: t = 0		−0.025*
		(0.014)
Post: t = 1		−0.041**
		(0.016)
Post: t = 2		−0.048***
		(0.017)
County Unemployment Rate			−0.012***
			(0.003)
Log(Population)			0.021*
			(0.011)
County FE	Yes	Yes	Yes
Year FE	Yes	Yes	Yes
Observations	3,850	3,850	3,784
R-squared	0.842	0.844	0.851
Clusters (counties)	385	385	378
Dep. Var. Mean	2.14	2.14	2.15

Table 3: Instrumental Variables (First Stage and Second Stage)

This table shows both stages of a two-stage least squares (2SLS) estimation, using distance to the nearest college as an instrument for years of education in an earnings regression. Understanding how to read first-stage and second-stage results together is essential for evaluating IV papers.

Annotated Table

Click any cell with * to see its annotation. Use arrow keys to navigate.

	(1) First Stage	(2) Reduced Form	(3) 2SLS Second Stage
Dependent Variable:	Years of Education	Log(Earnings)	Log(Earnings)
Distance to College (miles / 10)	−0.36***
	(0.08)
Distance to College		−0.031**
		(0.014)
Years of Education			0.086**
			(0.038)
Experience	−0.03	0.029***	0.032***
	(0.06)	(0.003)	(0.004)
Female	0.18*	−0.228***	−0.212***
	(0.10)	(0.014)	(0.022)
Urban	0.94***	0.085***	0.004
	(0.12)	(0.018)	(0.042)
County FE	Yes	Yes	Yes
Observations	42,087	42,087	42,087
R-squared	0.157	0.193	—
First-stage F-stat	20.25		20.25
Kleibergen-Paap F			20.25
Anderson-Rubin p-value			0.028

Common Reading Mistakes

Confusing sophistication with credibility. A paper that uses a simple DiD with a clear natural experiment is often more credible than a paper that uses frontier ML methods on observational data with no source of exogenous variation. The credibility revolution was built on design, not complexity.

Ignoring the estimand. Two papers can study "the effect of X on Y" and produce different estimates simply because they identify different estimands (ATE vs. LATE vs. ATT). This difference is not a contradiction — they are answering different questions.

Accepting null results too readily. A non-significant coefficient does not mean the effect is zero. It means the data cannot distinguish the effect from zero at the chosen significance level. Check the confidence interval: if it includes both zero and economically large effects, the study is uninformative, not evidence of no effect.

Forgetting that papers are advocacy. Authors want you to believe their results. They have chosen the specification, sample, and robustness checks that present their findings in the most favorable light. This selectivity is not dishonest — it is human. But it means a useful question is: What would the paper look like if the results had gone the other way?