Specification Curve Analysis
How much do your results depend on the specific analytical choices you made? Explore the full space of defensible specifications.
When to Use Specification Curve Analysis
Use specification curve analysis whenever your analysis involves choices that could plausibly affect your results — different outcome measures, control variable sets, sample restrictions, functional forms, or standard error specifications. It is especially valuable for observational studies where there is no single "correct" specification, and for pre-registered studies where you want to demonstrate that your headline result is not an artifact of one particular set of choices.
Why It Matters
When a paper reports a single specification, readers have no way of knowing whether that combination of choices is the one that happened to produce the most favorable result. Specification curve analysis makes this transparent by showing all defensible specifications simultaneously, letting readers judge whether the finding is robust or fragile. It is increasingly expected in top journals as a complement to pre-registration and sensitivity analysis.
The Problem of Researcher Degrees of Freedom
Every empirical analysis involves dozens of choices. Which sample do you use? Which control variables do you include? How do you define the outcome? Do you winsorize outliers at the 1st or 5th percentile? Do you use OLS or a nonlinear model? Do you cluster standard errors at the state level or the county level?
Each of these choices is defensible. But different choices lead to different results. And when a paper reports a single specification — the one the authors settled on after weeks of analysis — the reader has no way of knowing whether that particular combination of choices is the one that happened to produce the most favorable result.
This selectivity is the problem of . It is distinct from outright fraud or even intentional p-hacking. It arises because the space of reasonable analytic choices is large, and reporting only one point in that space is inherently selective.
Specification curve analysis is the systematic method for exploring the entire space of defensible choices and presenting the results transparently.
(Simonsohn et al., 2020)Two Frameworks, One Idea
Specification Curve Analysis (SCA)
Simonsohn et al. (2020) introduced SCA as a method for assessing the robustness of a finding across all reasonable specifications. The key insight: rather than presenting one "preferred" specification with a few robustness checks in an appendix, run every defensible specification and show the full distribution of results.
Multiverse Analysis
Steegen et al. (2016) proposed multiverse analysis, which emphasizes the data processing choices that precede statistical analysis — how variables are constructed, how missing data are handled, how the sample is defined.
(Steegen et al., 2016)How It Differs from Sensitivity Analysis for Unobservables
It is important to distinguish specification curve analysis from the sensitivity analysis covered in the sensitivity analysis practice page. They address different questions:
| Specification Curve / Multiverse | Sensitivity Analysis (Oster, Cinelli-Hazlett) | |
|---|---|---|
| Asks | "Is the result robust across defensible analytic choices?" | "How strong would an unobserved confounder need to be to explain away the result?" |
| Source of concern | Researcher degrees of freedom | Omitted variable bias |
| Varies | Observed specification choices | Hypothetical unobserved variables |
| Output | Distribution of estimates across specifications | Robustness values, bias-adjusted estimates |
A result can be robust to specification choices but fragile to unobserved confounding (or vice versa). Ideally, both dimensions are assessed.
Building a Specification Curve: Step by Step
Step 1: Define the Analytic Universe
List all defensible choices along each dimension. "Defensible" means a reasonable researcher could justify this choice on methodological grounds — not that you personally prefer it.
| Dimension | Example Options |
|---|---|
| Outcome | Log earnings, level earnings, employment indicator |
| Sample | Full sample, ages 25–55 only, drop outliers above 99th pctile |
| Controls | None, demographics only, demographics + baseline outcomes |
| Functional form | OLS, Poisson, log-linear |
| Fixed effects | None, year FE, state FE, state-by-year FE |
| Standard errors | Robust (HC1), clustered by state, clustered by individual |
If you have 3 outcome definitions, 3 sample definitions, 3 control sets, 2 functional forms, 4 FE structures, and 2 SE choices, that combination is specifications.
Step 2: Run All Specifications
This step is computationally intensive but straightforward. Loop over all combinations, estimate each model, and store the coefficient of interest, its standard error, and the p-value. Keep track of which choices produced each result.
Step 3: Plot the Specification Curve
The canonical specification curve figure has two panels:
Top panel: Point estimates (and optionally confidence intervals) sorted from smallest to largest. A horizontal line at zero shows where effects flip sign.
Bottom panel: A grid showing which analytic choices were active for each specification. Each row corresponds to one dimension (outcome, controls, sample, etc.), and dots or highlighting indicates the active choice. This layout lets readers see which choices drive the variation in results.
The bottom panel is often more informative than the top. It reveals patterns like: "the estimate is always large when we use log earnings and always small when we use levels" — which tells you something important about the nature of the result.
Step 4: Joint Inference
Simonsohn et al. (2020) propose a permutation-based joint test. Under the null of no effect:
- Randomly permute the treatment variable
- Re-run all specifications on the permuted data
- Compute a summary statistic (e.g., the median estimate across specs, or the share of specs with positive estimates)
- Repeat many times to build a null distribution
- Compare the observed summary statistic to this distribution
This joint test asks: "Is the overall pattern of results across all specifications more extreme than what you would expect by chance alone?"
Don't worry about the notation yet — here's what this means in words: Individual specification-level p-values suffer from multiple testing. The joint test asks whether the overall pattern across all specifications is consistent with the null, treating the entire specification curve as the unit of inference.
Suppose you run 432 specifications. Even under the null of no true effect, you expect about specifications to be significant at the 5% level. Simply reporting "78% of specifications are significant" is meaningless without a reference distribution — a point closely related to the multiple testing problem.
The permutation-based joint test constructs this reference distribution. For each permutation of the treatment variable, you compute all 432 specifications and record summary statistics (median estimate, share significant, share positive). The distribution of these summary statistics under the null tells you what to expect by chance. If 78% of your actual specifications are significant but the null distribution says you would expect at most 15%, the joint test rejects.
This approach sidesteps the multiple testing problem entirely, because the unit of inference is the curve, not any individual specification.
Interactive: Exploring the Specification Space
Specification Curve Explorer
Explore how different analytic choices affect the estimated treatment effect. Toggle between outcome definitions, control sets, and sample restrictions. Watch the point estimate shift as you change each dimension. The full specification curve on the right shows all possible combinations — your current choice is highlighted.
Set the true effect to zero and watch how many specifications still appear significant due to chance. This exercise illustrates why joint inference is essential.
Specification Curve Explorer
Toggle specification choices to see how researcher degrees of freedom produce a wide range of estimates. Each combination of controls, sample, and outcome transformation yields a different point estimate.
Summary
32/32
specifications significant at 5%
1.15 – 2.56
estimate range
2.02
median estimate
Your specification: estimate = 1.871, 95% CI = [0.702, 3.039]Significant
Researcher degrees of freedom: With just 5 binary choices, there are 32 possible specifications. Estimates range from 1.15 to 2.56. Specification curve analysis makes this variation transparent rather than hiding it behind a single “preferred” specification.
How to Do It: Code
library(specr)
# Create a subsetting variable for the specification space
df$age_group <- ifelse(df$age >= 25 & df$age <= 55, "prime_age", "other")
# Define the specification space (specr >= 1.0 API)
specs <- setup(
data = df,
y = c("earnings", "log_earnings"), # outcome options
x = "treatment", # treatment (fixed)
model = c("lm"), # model type
controls = c( # control sets
"age + gender",
"age + gender + education",
"age + gender + education + baseline_earnings"
),
subsets = list(age_group = c("prime_age")) # named list: variable = levels
)
# Run all specifications
results <- specr(specs)
# Summary statistics
summary(results)
# The canonical specification curve plot (estimates + choice indicators)
library(patchwork)
plot(results)How to Report a Specification Curve
A well-reported specification curve section includes:
- A clear description of the analytic universe. What choices did you vary, and why is each option defensible?
- The total number of specifications.
- The specification curve figure with both the estimate panel and the choice indicator panel.
- Summary statistics. What fraction of specifications produce a positive estimate? What fraction are significant? What is the median estimate and its interquartile range?
- Joint inference. The p-value from the permutation test.
- Discussion of what drives variation. Which choices matter most?
Example:
Figure 3 displays the specification curve across 432 specifications varying the outcome definition (3 options), sample restrictions (3 options), control variables (3 options), fixed effects (4 options), and standard error clustering (2 options). Across all specifications, 94% produce positive estimates and 78% are statistically significant at the 5% level. The median estimate is 0.12 (IQR: 0.08–0.18). The joint permutation test, based on 1,000 permutations, rejects the null of no effect (p < 0.001). The primary source of variation is the outcome definition: specifications using log earnings yield systematically larger effects than those using levels.
Common Mistakes
Concept Check
Your specification curve shows that 95% of specifications produce a statistically significant positive effect of a job training program. However, all specifications use the same observational data with no instrument or experiment. A colleague argues that the result is 'clearly causal' because it is robust across so many specifications. Is this reasoning correct?
Paper Library
Foundational (4)
Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification Curve Analysis.
Simonsohn, Simmons, and Nelson introduced specification curve analysis, which systematically runs all reasonable specifications of a model and displays the distribution of estimates. This approach replaces selective reporting of specifications with a comprehensive view of how results depend on analytical choices.
Leamer, E. E. (1983). Let's Take the Con Out of Econometrics.
Leamer's classic paper argued that the sensitivity of empirical results to specification choices undermines the credibility of econometric evidence. He proposed extreme bounds analysis, an early form of systematic robustness testing that anticipated modern specification curve analysis by several decades.
Munafo, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A Manifesto for Reproducible Science.
This manifesto identified threats to reproducible science, including analytical flexibility and specification searching, and proposed solutions including pre-registration and multiverse analysis. It provided the broader scientific reform context within which specification curve analysis emerged as a practical tool.
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis.
Steegen and colleagues introduced multiverse analysis, which examines how results vary across the full set of defensible data processing and analytical decisions. This approach is closely related to specification curve analysis and emphasizes transparency about the garden of forking paths in data analysis.
Application (5)
Young, C., & Holsteen, K. (2017). Model Uncertainty and Robustness: A Computational Framework for Multimodel Analysis.
Young and Holsteen developed a computational framework for systematically exploring model uncertainty by running thousands of plausible specifications. Their approach is one of the earliest implementations of what would become known as specification curve or multiverse analysis, applied to sociological research.
Rohrer, J. M., Egloff, B., & Schmukle, S. C. (2017). Probing Birth-Order Effects on Narrow Traits Using Specification-Curve Analysis.
Rohrer, Egloff, and Schmukle applied specification curve analysis to the long-debated question of whether birth order affects personality traits. By running all defensible specifications, they showed that most previously reported birth-order effects disappear, demonstrating the method's power to resolve contested empirical questions.
Goldfarb, B., & King, A. A. (2016). Scientific Apophenia in Strategic Management Research: Significance Tests & Mistaken Inference.
Goldfarb and King documented the problem of apophenia (finding patterns in noise) in strategic management research, driven partly by selective reporting of favorable specifications. They argued for multiverse-style robustness checks, making the case for specification curve analysis in management.
Orben, A., & Przybylski, A. K. (2019). The Association between Adolescent Well-Being and Digital Technology Use.
Orben and Przybylski applied specification curve analysis to the hotly debated question of whether digital technology use harms adolescent well-being, running over 20,000 specifications across three large datasets. They found that technology use has a negligible negative association with well-being, far smaller than commonly assumed, demonstrating how specification curve analysis can bring clarity to contested empirical questions by mapping the full space of defensible analytical choices.
Masicampo, E. J., & Lalande, D. (2012). A Peculiar Prevalence of p Values Just Below .05.
Masicampo and Lalande documented a suspicious clustering of p-values just below the .05 threshold in psychology journals, providing empirical evidence of publication bias and specification searching. Their findings motivate the use of specification curve analysis as a tool for assessing the robustness of results across analytical choices.