Foundation·Chapter 8 of 8·15 min read

Chapter 8 of 8

The Credibility Revolution

How empirical economics transformed itself — and what it means for your research.

The Mystery

The mystery resolved: here's how the field actually learned to answer questions like ours.

Prerequisites: Working with Data
Reading Time: ~15 min read · 10 sections · 2 interactive exercises

The Mystery Resolved

We have been carrying a question since the first page: Did the job training program actually help people earn more? Along the way, we discovered that naive comparison fails, that selection bias is the fundamental enemy, that we need precise language for what we are estimating, that DAGs can reveal hidden threats, and that we have a whole toolkit of identification strategies. On the previous page, we learned how to work with data in practice.

Now it is time to close the loop. The story of how researchers actually answered the training question — and how the field transformed its standards for evidence along the way — is the story of the .

This narrative is not just intellectual history. Understanding where the field has been will help you understand where it is going, why reviewers care about the things they care about, and what makes a paper convincing today.

Act I: The Crisis of Confidence

"Let's Take the Con out of Econometrics"

In 1983, Edward Leamer published a paper with one of the most provocative titles in the history of economics. In "Let's Take the Con out of Econometrics" (American Economic Review, 1983), he laid bare a disturbing pattern: empirical researchers were making claims of objectivity while making dozens of subjective choices about specification, sample, and functional form — and these choices drove the results.

Leamer's central argument: if you changed the set of control variables, the functional form of the regression, or the sample slightly, you could get almost any result you wanted. And researchers, consciously or not, were choosing specifications that produced "interesting" (i.e., statistically significant) results.

"Hardly anyone takes data analysis seriously. Or perhaps more accurately, hardly anyone takes anyone else's data analysis seriously."

—

The LaLonde Bombshell

Three years later, Robert LaLonde published a paper that turned the training question into a methodological battleground — and provided the most vivid demonstration of the problem Leamer had described.

LaLonde had access to data from the National Supported Work (NSW) Demonstration, a randomized job training program from the 1970s. Because the program was randomized, he knew the true experimental benchmark: the causal effect of training on earnings, estimated by simply comparing the randomly assigned treatment and control groups.

Then he did something clever and devastating. He threw away the experimental control group and replaced it with non-experimental comparison groups drawn from survey data (the Current Population Survey and the Panel Study of Income Dynamics). He applied the standard non-experimental methods of the day — regression and selection models (Heckman-type corrections) — to these observational datasets.

The results were alarming. The non-experimental estimates varied wildly depending on the comparison group and the statistical method. Some estimates were close to the experimental benchmark. Many were not even in the right ballpark. Some had the wrong sign.

LaLonde's conclusion was blunt: standard econometric methods could not reliably recover causal effects from observational data. The non-experimental estimates were sensitive to specification choices in exactly the way Leamer had warned .

Act II: The Response

Dehejia and Wahba: Can We Fix This Problem?

LaLonde's paper could have been the end of the story — a pessimistic conclusion that observational methods are hopeless. Instead, it became a challenge. Rajeev Dehejia and Sadek Wahba took up that challenge in an influential 1999 paper.

Dehejia and Wahba revisited the LaLonde data using methods — specifically, propensity score matching and stratification. Their key insight: LaLonde's non-experimental estimates failed in part because the comparison groups were poorly chosen, with individuals who looked nothing like the trainees on observable characteristics. Propensity score methods, by explicitly matching on the probability of treatment, could select a comparison group that was more similar to the treatment group.

Their results were dramatically better than LaLonde's. Using propensity score methods with careful attention to overlap and balance, they could recover estimates close to the experimental benchmark. The message: non-experimental methods can work, but only when applied with care and with attention to the specific threats to identification (Dehejia & Wahba, 1999).

The Rise of Natural Experiments

While the matching debate was unfolding, another response to the credibility crisis was gaining momentum: the turn toward .

The idea: instead of trying to statistically adjust observational data to look like an experiment (the model-based approach), find real-world situations where something like randomization actually happened (the design-based approach). This shift was driven by a generation of economists who demonstrated, through landmark studies, that credible causal evidence could come from clever exploitation of institutional features, policy changes, and historical quirks.

Joshua Angrist used Vietnam-era draft lottery numbers as an instrument to study the effect of military service on earnings . The lottery was genuinely random, so it provided exogenous variation in military service — a classic instrumental variables design.

David Card and Alan Krueger compared employment in fast-food restaurants across the New Jersey-Pennsylvania border before and after New Jersey raised its minimum wage . This comparison is one of the most famous difference-in-differences designs in economics.

Angrist and Krueger used quarter of birth as an instrument for schooling (since compulsory schooling laws mean students born in different quarters get different amounts of schooling), estimating the returns to education (Angrist & Krueger, 1991). (This instrument was later shown to be very weak by Bound et al. (1995), illustrating the importance of first-stage strength diagnostics.)

These studies, and dozens like them, demonstrated a new way of doing empirical economics: start with a source of exogenous variation, build a research design around it, and let the design — not the regression specification — carry the identification argument.

Act III: The Credibility Revolution Takes Hold

Angrist and Pischke Name the Movement

By 2010, the shift was unmistakable. Joshua Angrist and Jörn-Steffen Pischke gave it a name in their influential Journal of Economic Perspectives article: "The Credibility Revolution in Empirical Economics."

Their argument: empirical economics had undergone a transformation. The field had moved away from structural estimation with strong modeling assumptions and toward research designs that relied on transparent sources of identifying variation. The key methods of this revolution — instrumental variables, difference-in-differences, regression discontinuity — all shared a common logic: identify a credible source of exogenous variation and build the analysis around it (Angrist & Pischke, 2010).

The title was deliberate: it echoed Leamer's 1983 paper, framing the credibility revolution as the field's answer to Leamer's challenge. Where Leamer had shown that results were fragile to specification choices, the credibility revolution's core claim was that design-based methods are more robust because the source of variation — not the choice of controls or functional form — determines the results.

The 2021 Nobel Prize

The credibility revolution received its most prominent institutional validation in 2021, when David Card, Joshua Angrist, and Guido Imbens were awarded the Nobel Prize in Economics. The official citation recognized Card "for his empirical contributions to labour economics" (notably his minimum-wage and immigration studies), and Angrist and Imbens jointly "for their methodological contributions to the analysis of causal relationships" — most prominently the local average treatment effect (LATE) framework formalized in (Imbens & Angrist, 1994) and the broader machinery for drawing causal inferences from natural experiments.

The Credibility Revolution Timeline

Year	Milestone	Significance
1983	Leamer, "Let's Take the Con out of Econometrics"	Exposed the fragility of empirical results to specification choices
1986	LaLonde, "Evaluating the Econometric Evaluations"	Showed that non-experimental methods often failed to replicate experimental benchmarks
1990	Angrist, Vietnam Draft Lottery	Pioneered the use of natural experiments with a clean instrument
1994	Card & Krueger, Minimum Wage Difference-in-Differences (DiD)	Demonstrated the power of difference-in-differences with a natural experiment
1999	Dehejia & Wahba, Propensity Score Matching	Revived matching methods by replicating LaLonde's experimental benchmark
2010	Angrist & Pischke, "The Credibility Revolution"	Named the movement and articulated its principles
2021	Nobel Prize (Card, Angrist, Imbens)	Institutional validation of the design-based approach to causal inference

Act IV: The Transparency Movement

The credibility revolution did not stop at better research designs. A second wave of reforms — the transparency movement — addressed a complementary problem: even with good designs, researchers still had enough "degrees of freedom" in their analysis to (consciously or unconsciously) find the results they wanted.

Pre-Registration

The idea is simple: before you look at the data, write down your hypotheses, your empirical specification, your sample, and your statistical tests. Post this plan in a public registry. Then analyze the data. This pre-commitment eliminates (or at least constrains) the temptation to fish for significant results.

The American Economic Association launched its RCT Registry in 2013 (the executive committee approved its creation in April 2012), and pre-registration has become increasingly common in economics and adjacent social sciences.

Replication

If a result is real, other researchers should generally be able to reproduce it. The replication movement has pushed journals to require data and code availability (replication packages) and has funded systematic replication efforts.

Specification Curves and Robustness

Rather than reporting one "preferred" specification, researchers increasingly report the distribution of estimates across all reasonable specifications. Tools like specification curve analysis (Simonsohn et al., 2020) and multiverse analysis formalize this practice.

Sensitivity Analysis

How much unobserved confounding would it take to overturn your result? Methods like the Oster (2019) test and the Rosenbaum bounds framework give formal answers to this question. These sensitivity checks are increasingly common in empirical papers and are often requested by reviewers.

Pre-Revolution vs. Post-Revolution: A Comparison

What does the credibility revolution look like in practice? Here is a stylized comparison of how the same research question might be approached before and after.

Question: Does job training increase earnings?

Pre-Revolution Approach (circa 1980s)

Collect survey data on trainees and non-trainees.
Run an Ordinary Least Squares (OLS) regression of earnings on a training dummy plus a set of controls (education, age, etc.).
Report the coefficient on training as "the effect of training."
If the coefficient is not significant, try different control variables, different samples, or different functional forms until something works.
No discussion of the source of identifying variation. No sensitivity analysis. No pre-registered hypothesis.

Post-Revolution Approach (circa 2020s)

Identify a source of exogenous variation. Perhaps the training program was randomly assigned (experiment), or it was rolled out in stages across regions (DiD), or eligibility depended on a test score cutoff (regression discontinuity design (RDD)).
State the identifying assumptions explicitly. For DiD: parallel trends. For RDD: continuity at the cutoff. Draw the DAG.
Pre-register the analysis plan (if prospective).
Implement the design-appropriate estimator with proper standard errors.
Show diagnostics: pre-trends plot (DiD), density test (RDD), first stage F-statistic (instrumental variables (IV)).
Report sensitivity analysis: How robust is the result to violations of the key assumption?
Show specification curves: How stable is the estimate across reasonable alternative specifications?
Provide a replication package with data and code.

Try the following experiments:

Set to zero. All three methods recover the true effect — when there is no selection, even naive comparison works.
Increase selection bias to $3,500. The naive estimate is dramatically inflated, but the experiment still gives you the truth. The DiD estimate depends on whether parallel trends hold.
Add a parallel trends violation. Now even DiD is biased — the "natural experiment" is not as clean as it seemed. The lesson: every method has assumptions, and violating those assumptions undermines the estimate.

Concept Check

A researcher in 1985 runs an OLS regression of earnings on a training dummy with controls for age and education, finds a positive significant coefficient, and concludes that 'training increases earnings.' From the perspective of the credibility revolution, what is the fundamental problem with this conclusion?

OLS is never appropriate for causal questionsThe researcher did not identify a credible source of exogenous variation in training statusThe sample size was too smallThe researcher should have used a logit model instead of OLS

Resolving the Training Mystery

We can now tell the complete story of our running example:

The question: Does job training increase earnings?

showed that the answer depends heavily on your method. Using the NSW experimental data, the true effect was clear. But when he applied standard non-experimental methods to observational data, the estimates were all over the map. This divergence was the crisis.

Dehejia and Wahba (1999) showed that propensity score methods could do better — recovering estimates closer to the experimental benchmark — but only with careful attention to overlap, balance, and comparison group selection.

The credibility revolution's lesson: The reason LaLonde's non-experimental estimates failed was not that econometrics is hopeless. It was that the researchers relied on model-based adjustments without a credible source of exogenous variation. When you have a genuine natural experiment (a lottery, a policy cutoff, a staggered rollout), design-based methods can recover credible causal effects. When you only have observational data, model-based methods can work — but you need to be explicit about assumptions, transparent about specification choices, and honest about what could go wrong.

Where the methods come in: Every method you will learn on this site is a tool for exploiting a specific type of variation:

Randomized experiments exploit researcher-controlled randomization (the NSW study itself)
Difference-in-differences exploits differential timing of a policy (regions adopting training programs at different times)
Regression discontinuity exploits eligibility cutoffs (test score thresholds for program entry)
Instrumental variables exploit exogenous shifters (lottery-based assignment to training slots)
Matching and propensity scores exploit rich observable data (the Dehejia-Wahba approach)

The training mystery is not just a pedagogical device. It is a microcosm of the central challenge of empirical social science: separating causation from correlation. Every paper you write, every paper you read, every paper you referee will grapple with this challenge. The tools you will learn on this site are how the field has learned to do it credibly.

Where the Revolution Stands Today

The credibility revolution is not complete. Several active debates are shaping its next phase:

1. External validity. Design-based methods give credible estimates of local effects (for the specific population, place, and time studied). But policymakers need to know whether results generalize. The field is developing tools for transportability and extrapolation.

2. Heterogeneous treatment effects. The average treatment effect may mask enormous variation: training might help some people enormously and do nothing for others. Machine learning methods for causal inference (Double/Debiased Machine Learning (DML), causal forests) are tackling this frontier.

3. Mechanisms. Design-based methods are good at answering "does X cause Y?" but less good at "how does X cause Y?" Causal mediation analysis is an active area of methodological development.

4. Reproducibility. Despite replication packages, many published results are difficult to reproduce. Computational reproducibility (can you run the code and get the same numbers?) remains surprisingly challenging.

5. Publication bias. Pre-registration helps, but the incentive to find significant results remains strong. Registered reports — where journals accept papers before results are known — are one response.

✓Key Takeaways

Key Takeaways

The credibility revolution is the shift in empirical social science from model-based approaches (where credibility comes from the statistical specification) to design-based approaches (where credibility comes from the source of exogenous variation).
diagnosed the problem: empirical results were fragile to specification choices. demonstrated the problem vividly: standard non-experimental methods could not recover the causal effect of job training.
Dehejia and Wahba (1999) showed that careful propensity score methods could improve on LaLonde's results, but the broader lesson remained: method alone is not enough without attention to identification.
The natural experiments movement (Card, Angrist, Krueger, Imbens, and many others) demonstrated that credible causal evidence could come from clever exploitation of real-world sources of exogenous variation.
Angrist and Pischke (2010) named the revolution and articulated its principles. The 2021 Nobel Prize to Card, Angrist, and Imbens institutionalized it.
The transparency movement (pre-registration, replication packages, specification curves, sensitivity analysis) addresses the remaining problem of researcher degrees of freedom.
The training mystery is resolved: the answer to "did job training increase earnings?" depends on how you answer it. The credibility revolution is the story of the field learning which answers to trust.

→What Comes Next

You have completed the Foundations sequence. You understand why causal inference matters, what threatens it, how to think about it visually and formally, what tools are available, how to work with data, and how the field learned to demand credibility.

Now it is time to learn the methods themselves. We recommend starting with OLS — the building block for nearly everything else — and then following one of the Learning Paths based on your interests and your research needs.

The training mystery will continue to appear throughout the site. Every method page shows how that method could be applied to questions like ours. By the time you have worked through several methods, you will have deep intuition not just for what each tool does, but for when to reach for it and why to trust (or distrust) its results.

Welcome to the credibility revolution. Now let us do some research.

Next Step: You have completed the Foundations sequence. Ready to learn your first method? Start with OLS Regression — the building block for nearly everything else — or explore the Learning Paths to find the sequence that fits your research interests.

The Mystery Resolved#

Act I: The Crisis of Confidence#

"Let's Take the Con out of Econometrics"#

The LaLonde Bombshell#

Act II: The Response#

Dehejia and Wahba: Can We Fix This Problem?#

The Rise of Natural Experiments#

Act III: The Credibility Revolution Takes Hold#

Angrist and Pischke Name the Movement#

The 2021 Nobel Prize#

Act IV: The Transparency Movement#

Pre-Registration#

Replication#

Specification Curves and Robustness#

Sensitivity Analysis#

Pre-Revolution vs. Post-Revolution: A Comparison#

Pre-Revolution Approach (circa 1980s)#

Post-Revolution Approach (circa 2020s)#

Resolving the Training Mystery#

Where the Revolution Stands Today#

✓Key Takeaways#

→What Comes Next#