Pre-Analysis Plans & Pre-Registration
Commit to your analysis before seeing the results — the antidote to the garden of forking paths.
When to Use Pre-Registration
Pre-register any time you have researcher degrees of freedom — choices about outcomes, specifications, sample restrictions, or subgroups — that could influence results. This includes: randomized controlled trials (register before data collection), quasi-experimental studies where the data source and policy change are known (register before analysis), studies with multiple outcomes or subgroup analyses, and any project where you want to credibly distinguish confirmatory from exploratory findings. Registration is most valuable before data collection, but registering before analysis still provides meaningful constraint.
Why It Matters
Pre-registration separates confirmatory from exploratory analysis. Without it, readers cannot distinguish genuine findings from results that emerged through specification searching. A time-stamped analysis plan demonstrates that your results were not reverse-engineered from the data, which is why top journals in economics and political science now routinely require or reward pre-registration for experimental studies.
Binding Your Own Hands
Here is a confession that no one in academia enjoys making: given a dataset, a smart researcher can almost always find something that looks significant. Not through fraud. Not through conscious manipulation. Just through the ordinary, human process of exploring data, trying different specifications, and — without realizing it — gravitating toward the results that tell the most interesting story.
This temptation is not a moral failing. It is a structural problem with how hypothesis testing works. The p-value only means what it claims to mean (the probability of seeing data this extreme under the null) if the analysis was specified before looking at the data. The moment you let the data influence your analytic choices — which variables to include, how to define the sample, which outcomes to emphasize — the p-value loses its interpretation.
Pre-analysis plans are the solution. They are documents, written and time-stamped before you analyze your data, that specify exactly what you plan to do. They bind your hands against inadvertent data mining and give your results a credibility that post-hoc analyses cannot match.
If this idea sounds constraining, that reaction is natural. That constraint is the point.
Interactive: The Garden of Forking Paths
Andrew Gelman and Eric Loken coined a vivid metaphor for this problem: the garden of forking paths.
At every stage of analysis, you face choices:
- How do you define the treatment? (Binary? Continuous? Dosage?)
- Which control variables do you include?
- How do you handle outliers? (Winsorize? Trim? Log-transform?)
- Which sample restrictions do you impose? (Age cutoffs? Time windows?)
- Which outcomes do you emphasize?
- Which subgroups do you examine?
- How do you cluster standard errors?
Each choice is a fork. Each fork leads to a different result. If you make these choices after seeing the data — even subconsciously — you are walking a path through the garden that the data itself has selected. The resulting p-value no longer reflects the probability you think it does.
The forking paths problem is distinct from p-hacking. P-hacking implies intent. The forking paths problem arises even with the most careful intentions, because researchers naturally gravitate toward specifications that "make sense" — and "makes sense" is often a synonym for "gives interesting results."
Garden of Forking Paths
See how researcher degrees of freedom inflate false positive rates — even without deliberate p-hacking.
Computed Results
- Total Possible Specifications
- 81
- Prob. at Least One p < 0.05 (under null)
- 0.98
What Goes in a Pre-Analysis Plan
A good pre-analysis plan (PAP) has the following components. Not every study needs all of them, but the more you can specify in advance, the stronger your credibility.
1. Research Question and Hypotheses
State your hypotheses clearly, in directional terms when possible. "We hypothesize that the program will increase test scores" is better than "we will examine the effect of the program on test scores."
2. Data Description
- What data will you use? (Survey, administrative, experimental)
- What is the sample? (Who is included, who is excluded, and why?)
- What is the unit of observation?
- When was/will the data be collected?
3. Variable Definitions
- Treatment variable: How is treatment defined? What constitutes treatment and control?
- Primary outcome(s): The main outcome variable(s) you will analyze. Be precise about construction (e.g., "math test score, standardized to have mean zero and standard deviation one in the control group").
- Secondary outcomes: Outcomes you will examine but that are not the main focus.
- Control variables: Which covariates will you include and why?
4. Estimation Strategy
- What is the estimating equation? Write it out:
- What standard errors will you use? (Robust? Clustered? At what level?)
- What is the unit of randomization vs. unit of analysis?
5. Subgroup Analyses
- Which subgroups will you examine? (By gender, age, baseline characteristics?)
- Are these exploratory or confirmatory?
6. Multiple Testing
- How will you handle multiple outcomes? (Bonferroni? FDR? Romano-Wolf? Index construction?)
- Which outcomes are grouped into families?
7. Sample Size and Power Calculations
- What is your expected sample size?
- What is the minimum detectable effect?
- What assumptions underlie the power calculation?
8. Missing Data and Attrition
- How will you handle missing data? (Complete cases? Imputation?)
- What is your plan if attrition is differential?
- Will you compute Lee bounds?
9. Deviations Protocol
- Under what circumstances would you deviate from the plan?
- How will you flag deviations in the paper?
Where to Register
Three major registries dominate the social sciences:
AEA RCT Registry
- URL: AEA RCT Registry
- Best for: Randomized controlled trials in economics
- Features: Time-stamped, publicly searchable, can embargo until publication
- Cost: Free
- Accepted by: All top economics journals
EGAP Registry
- URL: EGAP Registry
- Best for: Experiments and observational studies in political science and policy
- Features: Strong community of practice, design-based studies
- Cost: Free
OSF (Open Science Framework)
- URL: Open Science Framework (OSF)
- Best for: Any study in any field, including observational studies
- Features: Most flexible, integrates with GitHub, supports pre-prints and data hosting
- Cost: Free
How to Report Pre-Registration
Once you have pre-registered your study, you need to reference it properly in your paper and handle the inevitable deviations transparently.
Referencing the Pre-Registration in Your Paper
Papers based on a pre-registered study typically include the registration details. In the methods section or a footnote, provide:
- The registry name and registration number (e.g., "AEA RCT Registry #AEARCTR-0005432")
- The date of registration
- A URL or DOI linking to the pre-analysis plan
- Whether the PAP was registered before data collection, before data analysis, or after data collection but before analysis
Template language for the methods section:
This study was pre-registered on [Registry Name] on [Date] (Registration #[Number]; [URL]). The pre-analysis plan specified the primary outcomes, estimation strategy, and subgroup analyses reported below. All deviations from the pre-analysis plan are explicitly noted.
Handling Deviations from the Plan
You will almost certainly deviate from your pre-analysis plan. This deviation is normal and expected. What matters is transparency. For each deviation:
- Report the pre-registered analysis first. In most settings, show what you originally planned, even if you now believe a different specification is better.
- Present the revised analysis alongside it. Show both versions so readers can assess the sensitivity.
- Explain why you deviated. Common reasons include discovering data quality issues, a variable being unavailable, or a pre-specified model failing to converge. State the reason clearly.
- Flag the deviation explicitly. Use a footnote, a dedicated paragraph, or a table note to mark each change from the plan.
Reporting Pre-Registered and Exploratory Analyses Together
A well-structured paper contains both types of analyses, clearly distinguished:
- Confirmatory analyses are those specified in the PAP. Present these as your primary results. They carry the strongest evidentiary weight because they were chosen before seeing the data.
- Exploratory analyses are everything else — additional subgroups, alternative specifications, new outcomes you discovered along the way. These exploratory results are valuable and merit reporting, but it is important to clearly label them as exploratory.
A clean structure looks like this:
- Main results section: Report the pre-registered primary specification and primary outcomes.
- Additional results or extensions: Report exploratory analyses, clearly labeled. Use language like "In exploratory analyses not included in our pre-analysis plan, we find that..."
- Robustness section: Show that results hold under alternative specifications, including both pre-registered robustness checks and additional ones.
Template language for exploratory findings:
The following analyses were not included in our pre-analysis plan and should be interpreted as exploratory. We report them for transparency and to motivate future pre-registered investigations.
How to Do It: Code
While pre-registration is primarily a planning exercise, several tools help structure and document your pre-analysis plan:
# DeclareDesign helps you formally declare your research design
# before collecting data
library(DeclareDesign)
design <- declare_model(
N = 500, U = rnorm(N),
potential_outcomes(Y ~ 0.3 * Z + U)
) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N, m = 250)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")
# Diagnose expected power before running the study
diagnose_design(design, sims = 500)A Template Walkthrough
Here is a concrete example of what a pre-analysis plan looks like for a hypothetical study evaluating a job training program:
Title: The Effect of WorkReady Job Training on Employment and Earnings
Hypotheses: H1: WorkReady increases employment rates at 12 months (primary). H2: WorkReady increases quarterly earnings at 12 months (primary). H3: WorkReady increases employment rates at 24 months (secondary).
Design: Randomized controlled trial. 2,000 applicants randomized 1:1 to treatment/control.
Primary specification: where includes age, gender, education, baseline earnings (all pre-specified). Standard errors: robust (HC1). Estimand: Intent-to-treat (ITT).
Multiple testing: Primary outcomes (H1, H2) will be adjusted using Holm step-down. Secondary outcome (H3) reported with unadjusted p-values but flagged as secondary.
Subgroups: We will examine heterogeneity by gender and by baseline education (above/below median). These subgroup analyses are labeled as exploratory.
Attrition: If differential attrition exceeds 5 percentage points, we will compute Lee bounds.
Power: With N = 2,000, we can detect a 4-percentage-point increase in employment (from a control mean of 55%) with 80% power at the 5% level.
The Promises and Perils
Olken (2015) offers a thoughtful assessment of both the benefits and costs of pre-analysis plans.
(Olken, 2015)Promises:
- Eliminates (or greatly reduces) data mining and specification searching
- Distinguishes confirmatory from exploratory analyses
- Increases the credibility of significant results
- Provides a clear record of what was planned vs. what was discovered
Perils:
- Can be excessively rigid, preventing researchers from learning from the data
- May discourage creative exploration that leads to genuine discoveries
- A badly written PAP can lock you into a bad specification
- Some researchers write vague PAPs that do not actually constrain anything
The resolution is straightforward: pre-register your confirmatory analyses, clearly label any deviations, and conduct exploratory analyses freely — just call them exploratory. A paper can contain both pre-registered and exploratory results — this is the norm, not the exception. The key is transparency about which is which.
Common Mistakes
Concept Check
A researcher pre-registers a study with one primary outcome (test scores) and five secondary outcomes. The primary outcome shows p = 0.04 and two secondary outcomes show p = 0.03 each. What should the researcher report?
Paper Library
Foundational (5)
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The Preregistration Revolution.
Nosek and colleagues made the case for widespread adoption of pre-registration, arguing that it distinguishes confirmatory from exploratory analyses, reduces publication bias, and increases the credibility of empirical research. This paper helped catalyze the pre-registration movement across the social sciences.
Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K. M., Gerber, A., Glennerster, R., Green, D. P., Humphreys, M., Imbens, G., Laitin, D., Madon, T., Nelson, L., Nosek, B. A., Petersen, M., Sedlmayr, R., Simmons, J. P., Simonsohn, U., & Van der Laan, M. (2014). Promoting Transparency in Social Science Research.
A coalition of leading social scientists called for greater transparency in research, including pre-registration of studies and analysis plans, open data, and replication. This short but influential piece in Science helped establish the norms and infrastructure for pre-registration in social science.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.
Simmons, Nelson, and Simonsohn demonstrated how researcher degrees of freedom in data collection and analysis can inflate false-positive rates dramatically. Their paper, which proposed disclosure requirements and pre-registration as solutions, was one of the catalysts for the replication crisis and pre-registration movement.
Gelman, A., & Loken, E. (2013). The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No 'Fishing Expedition' or 'p-Hacking' and the Research Hypothesis Was Posited Ahead of Time.
Gelman and Loken argued that even without deliberate p-hacking, the multitude of defensible analytical choices creates a 'garden of forking paths' that inflates false-positive rates. This influential working paper provided a key intellectual motivation for pre-registration by showing that researcher degrees of freedom are unavoidable without pre-commitment.
Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science.
Gelman and Loken summarized the statistical crisis in science, emphasizing how researcher degrees of freedom and the garden of forking paths lead to unreliable findings. This accessible piece extended their 2013 working paper and reinforced the case for pre-registration as a solution to the replication crisis.
Application (5)
Olken, B. A. (2015). Promises and Perils of Pre-Analysis Plans.
Olken provided a balanced assessment of pre-analysis plans in development economics, discussing both benefits (reduced specification searching, increased credibility) and costs (loss of flexibility, difficulty specifying analyses in advance). This paper is essential reading for understanding the practical tradeoffs of pre-registration.
Coffman, L. C., & Niederle, M. (2015). Pre-Analysis Plans Have Limited Upside, Especially Where Replications Are Feasible.
Coffman and Niederle offered a skeptical perspective on pre-analysis plans, arguing that their benefits are limited when replication is feasible and that rigid adherence to pre-specified analyses can prevent researchers from learning from the data. This paper provides important counterarguments in the pre-registration debate.
Aguinis, H., Ramani, R. S., & Alabduljader, N. (2018). What You See Is What You Get? Enhancing Methodological Transparency in Management Research.
Aguinis, Ramani, and Alabduljader reviewed methodological transparency in management research and advocated for pre-registration, open data, and open materials. They documented the extent of undisclosed analytical flexibility in management studies and proposed concrete steps for improvement.
Christensen, G., & Miguel, E. (2018). Transparency, Reproducibility, and the Credibility of Economics Research.
Christensen and Miguel surveyed the transparency and reproducibility landscape in economics, documenting the growing adoption of pre-registration through the AEA RCT Registry and other platforms. They presented evidence on the prevalence of specification searching and publication bias, and made the case that pre-registration combined with pre-analysis plans substantially improves the credibility of empirical findings.
Casey, K., Glennerster, R., & Miguel, E. (2012). Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.
Casey, Glennerster, and Miguel provided one of the most prominent examples of a pre-analysis plan in economics, pre-registering their analysis of a community-driven development program in Sierra Leone. They demonstrated both the benefits of pre-commitment and the practical challenges of adhering to a pre-specified plan.
Survey (1)
Haven, T. L., & Van Grootel, L. (2019). Preregistering Qualitative Research.
Haven and Van Grootel explored extending pre-registration to qualitative research, discussing what elements of qualitative studies can and should be pre-registered. This paper broadens the pre-registration conversation beyond quantitative experimental designs.