Chapter 8 of 8
The Credibility Revolution
How empirical economics transformed itself — and what it means for your research.
The Mystery Resolved
We have been carrying a question since F1: Did the job training program actually help people earn more? Along the way, we discovered that naive comparison fails (F1), that selection bias is the fundamental enemy (F3), that we need precise language for what we are estimating (F4), that DAGs can reveal hidden threats (F5), and that we have a whole toolkit of identification strategies (F6). In F7, we learned how to work with data in practice.
Now it is time to close the loop. The story of how researchers actually answered the training question — and how the field transformed its standards for evidence along the way — is the story of the credibility revolution.
This narrative is not just intellectual history. Understanding where the field has been will help you understand where it is going, why reviewers care about the things they care about, and what makes a paper convincing today.
Act I: The Crisis of Confidence
"Let's Take the Con out of Econometrics"
In 1983, Edward Leamer published a paper with one of the most provocative titles in the history of economics. In "Let's Take the Con out of Econometrics" (American Economic Review, 1983), he laid bare a disturbing pattern: empirical researchers were making claims of objectivity while making dozens of subjective choices about specification, sample, and functional form — and these choices drove the results.
Leamer's central argument: if you changed the set of control variables, the functional form of the regression, or the sample slightly, you could get almost any result you wanted. And researchers, consciously or not, were choosing specifications that produced "interesting" (i.e., statistically significant) results.
"Hardly anyone takes data analysis seriously. Or perhaps more accurately, hardly anyone takes anyone else's data analysis seriously."
— Leamer (1983, p. 37)
The LaLonde Bombshell
Three years later, Robert LaLonde published a paper that turned the training question into a methodological battleground — and provided the most vivid demonstration of the problem Leamer had described.
LaLonde had access to data from the National Supported Work (NSW) Demonstration, a randomized job training program from the 1970s. Because the program was randomized, he knew the true experimental benchmark: the causal effect of training on earnings, estimated by simply comparing the randomly assigned treatment and control groups.
Then he did something clever and devastating. He threw away the experimental control group and replaced it with non-experimental comparison groups drawn from survey data (the Current Population Survey and the Panel Study of Income Dynamics). He applied the standard non-experimental methods of the day — regression, selection models, difference-in-differences — to these observational datasets.
The results were alarming. The non-experimental estimates varied wildly depending on the comparison group and the statistical method. Some estimates were close to the experimental benchmark. Many were not even in the right ballpark. Some had the wrong sign.
LaLonde's conclusion was blunt: standard econometric methods could not reliably recover causal effects from observational data. The non-experimental estimates were sensitive to specification choices in exactly the way Leamer had warned.
Act II: The Response
Dehejia and Wahba: Can We Fix This Problem?
LaLonde's paper could have been the end of the story — a pessimistic conclusion that observational methods are hopeless. Instead, it became a challenge. Rajeev Dehejia and Sadek Wahba took up that challenge in an influential 1999 paper.
Dehejia and Wahba revisited the LaLonde data using methods — specifically, propensity score matching and stratification. Their key insight: LaLonde's non-experimental estimates failed in part because the comparison groups were poorly chosen, with individuals who looked nothing like the trainees on observable characteristics. Propensity score methods, by explicitly matching on the probability of treatment, could select a comparison group that was more similar to the treatment group.
Their results were dramatically better than LaLonde's. Using propensity score methods with careful attention to overlap and balance, they could recover estimates close to the experimental benchmark. The message: non-experimental methods can work, but only when applied with care and with attention to the specific threats to identification.
(Dehejia & Wahba, 1999)The Rise of Natural Experiments
While the matching debate was unfolding, another response to the credibility crisis was gaining momentum: the turn toward .
The idea: instead of trying to statistically adjust observational data to look like an experiment (the model-based approach), find real-world situations where something like randomization actually happened (the design-based approach). This shift was driven by a generation of economists who demonstrated, through landmark studies, that credible causal evidence could come from clever exploitation of institutional features, policy changes, and historical quirks.
Joshua Angrist used Vietnam-era draft lottery numbers as an instrument to study the effect of military service on earnings. The lottery was genuinely random, so it provided exogenous variation in military service — a classic instrumental variables design.
David Card and Alan Krueger compared employment in fast-food restaurants across the New Jersey–Pennsylvania border before and after New Jersey raised its minimum wage. This comparison is one of the most famous difference-in-differences designs in economics.
(Card & Krueger, 1994)Angrist and Krueger used quarter of birth as an instrument for schooling (since compulsory schooling laws mean students born in different quarters get different amounts of schooling), estimating the returns to education.
(Angrist & Krueger, 1991)These studies, and dozens like them, demonstrated a new way of doing empirical economics: start with a source of exogenous variation, build a research design around it, and let the design — not the regression specification — carry the identification argument.
Act III: The Credibility Revolution Takes Hold
Angrist and Pischke Name the Movement
By 2010, the shift was unmistakable. Joshua Angrist and Jorn-Steffen Pischke gave it a name in their influential Journal of Economic Perspectives article: "The Credibility Revolution in Empirical Economics."
Their argument: empirical economics had undergone a transformation. The field had moved away from structural estimation with strong modeling assumptions and toward research designs that relied on transparent sources of identifying variation. The key methods of this revolution — instrumental variables, difference-in-differences, regression discontinuity — all shared a common logic: identify a credible source of exogenous variation and build the analysis around it.
(Angrist & Pischke, 2010)The title was deliberate: it echoed Leamer's 1983 paper, framing the credibility revolution as the field's answer to Leamer's challenge. Where Leamer had shown that results were fragile to specification choices, the credibility revolution's core claim was that design-based methods are more robust because the source of variation — not the choice of controls or functional form — drives the results.
The 2021 Nobel Prize
The credibility revolution received its ultimate institutional validation in 2021, when David Card, Joshua Angrist, and Guido Imbens were awarded the Nobel Prize in Economics. Card was recognized for his empirical contributions to labor economics (including the minimum wage study), while Angrist and Imbens were recognized for their methodological contributions to the analysis of causal relationships — particularly the LATE framework and the formalization of how natural experiments can be used for causal inference.
Act IV: The Transparency Movement
The credibility revolution did not stop at better research designs. A second wave of reforms — the transparency movement — addressed a complementary problem: even with good designs, researchers still had enough "degrees of freedom" in their analysis to (consciously or unconsciously) find the results they wanted.
Pre-Registration
The idea is simple: before you look at the data, write down your hypotheses, your empirical specification, your sample, and your statistical tests. Post this plan in a public registry. Then analyze the data. This pre-commitment eliminates (or at least constrains) the temptation to fish for significant results.
The American Economic Association launched its RCT Registry in 2013, and pre-registration has become increasingly common — and increasingly expected — in top journals.
Replication
If a result is real, other researchers should be able to reproduce it. The replication movement has pushed journals to require data and code availability (replication packages) and has funded systematic replication efforts.
Specification Curves and Robustness
Rather than reporting one "preferred" specification, researchers increasingly report the distribution of estimates across all reasonable specifications. Tools like specification curve analysis ((Simonsohn et al., 2020)) and multiverse analysis formalize this practice.
Sensitivity Analysis
How much unobserved confounding would it take to overturn your result? Methods like the Oster (2019) test and the Rosenbaum bounds framework give formal answers to this question. These sensitivity checks are now expected in many top journals.
Pre-Revolution vs. Post-Revolution: A Comparison
What does the credibility revolution look like in practice? Here is a stylized comparison of how the same research question might be approached before and after.
Question: Does job training increase earnings?
Pre-Revolution Approach (circa 1980s)
- Collect survey data on trainees and non-trainees.
- Run an OLS regression of earnings on a training dummy plus a set of controls (education, age, etc.).
- Report the coefficient on training as "the effect of training."
- If the coefficient is not significant, try different control variables, different samples, or different functional forms until something works.
- No discussion of the source of identifying variation. No sensitivity analysis. No pre-registered hypothesis.
Post-Revolution Approach (circa 2020s)
- Identify a source of exogenous variation. Perhaps the training program was randomly assigned (experiment), or it was rolled out in stages across regions (DiD), or eligibility depended on a test score cutoff (RDD).
- State the identifying assumptions explicitly. For DiD: parallel trends. For RDD: continuity at the cutoff. Draw the DAG.
- Pre-register the analysis plan (if prospective).
- Implement the design-appropriate estimator with proper standard errors.
- Show diagnostics: pre-trends plot (DiD), density test (RDD), first stage F-statistic (IV).
- Report sensitivity analysis: How robust is the result to violations of the key assumption?
- Show specification curves: How stable is the estimate across reasonable alternative specifications?
- Provide a replication package with data and code.
The LaLonde Challenge
The credibility revolution in one simulation. See how the same treatment effect can be recovered or completely missed depending on your identification strategy. The experiment gives you the truth; the naive comparison adds selection bias; a well-chosen natural experiment (DiD) can recover the truth even from observational data.
Try the following experiments:
- Set selection bias to zero. All three methods recover the true effect — when there is no selection, even naive comparison works.
- Increase selection bias to $3,500. The naive estimate is dramatically inflated, but the experiment still gives you the truth. The DiD estimate depends on whether parallel trends hold.
- Add a parallel trends violation. Now even DiD is biased — the "natural experiment" is not as clean as it seemed. The lesson: every method has assumptions, and violating those assumptions undermines the estimate.
A researcher in 1985 runs an OLS regression of earnings on a training dummy with controls for age and education, finds a positive significant coefficient, and concludes that 'training increases earnings.' From the perspective of the credibility revolution, what is the fundamental problem with this conclusion?
Resolving the Training Mystery
We can now tell the complete story of our running example:
The question: Does job training increase earnings?
showed that the answer depends entirely on your method. Using the NSW experimental data, the true effect was clear. But when he applied standard non-experimental methods to observational data, the estimates were all over the map. This divergence was the crisis.
Dehejia and Wahba (1999) showed that propensity score methods could do better — recovering estimates closer to the experimental benchmark — but only with careful attention to overlap, balance, and comparison group selection.
The credibility revolution's lesson: The reason LaLonde's non-experimental estimates failed was not that econometrics is hopeless. It was that the researchers relied on model-based adjustments without a credible source of exogenous variation. When you have a genuine natural experiment (a lottery, a policy cutoff, a staggered rollout), design-based methods can recover credible causal effects. When you only have observational data, model-based methods can work — but you need to be explicit about assumptions, transparent about specification choices, and honest about what could go wrong.
Where the methods come in: Every method you will learn on this site is a tool for exploiting a specific type of variation:
- Randomized experiments exploit researcher-controlled randomization (the NSW study itself)
- Difference-in-differences exploits differential timing of a policy (regions adopting training programs at different times)
- Regression discontinuity exploits eligibility cutoffs (test score thresholds for program entry)
- Instrumental variables exploit exogenous shifters (lottery-based assignment to training slots)
- Matching and propensity scores exploit rich observable data (the Dehejia-Wahba approach)
The training mystery is not just a pedagogical device. It is a microcosm of the central challenge of empirical social science: separating causation from correlation. Every paper you write, every paper you read, every paper you referee will grapple with this challenge. The tools you will learn on this site are how the field has learned to do it credibly.
Where the Revolution Stands Today
The credibility revolution is not complete. Several active debates are shaping its next phase:
1. External validity. Design-based methods give credible estimates of local effects (for the specific population, place, and time studied). But policymakers need to know whether results generalize. The field is developing tools for transportability and extrapolation.
2. Heterogeneous treatment effects. The average treatment effect may mask enormous variation: training might help some people enormously and do nothing for others. Machine learning methods for causal inference (DML, causal forests) are tackling this frontier.
3. Mechanisms. Design-based methods are good at answering "does X cause Y?" but less good at "how does X cause Y?" Causal mediation analysis is an active area of methodological development.
4. Reproducibility. Despite replication packages, many published results are difficult to reproduce. Computational reproducibility (can you run the code and get the same numbers?) remains surprisingly challenging.
5. Publication bias. Pre-registration helps, but the incentive to find significant results remains strong. Registered reports — where journals accept papers before results are known — are one response.
Key Takeaways
What Comes Next
You have completed the Foundations sequence. You understand why causal inference matters, what threatens it, how to think about it visually and formally, what tools are available, how to work with data, and how the field learned to demand credibility.
Now it is time to learn the methods themselves. We recommend starting with OLS — the building block for nearly everything else — and then following one of the Learning Paths based on your interests and your research needs.
The training mystery will continue to appear throughout the site. Every method page shows how that method could be applied to questions like ours. By the time you have worked through several methods, you will have deep intuition not just for what each tool does, but for when to reach for it and why to trust (or distrust) its results.
Welcome to the credibility revolution. Now let us do some research.
Next Step: You have completed the Foundations sequence. Ready to learn your first method? Start with OLS Regression — the building block for nearly everything else — or explore the Learning Paths to find the sequence that fits your research interests.