MethodAtlas
Robustness Stage

Specification Curve Analysis

How much do your results depend on the specific analytical choices you made? Explore the full space of defensible specifications.

When to Use Specification Curve Analysis

Use specification curve analysis whenever your analysis involves choices that could plausibly affect your results — different outcome measures, control variable sets, sample restrictions, functional forms, or standard error specifications. It is especially valuable for observational studies where there is no single "correct" specification, and for pre-registered studies where you want to demonstrate that your headline result is not an artifact of one particular set of choices.


Why It Matters

When a paper reports a single specification, readers have no way of knowing whether that combination of choices is the one that happened to produce the most favorable result. Specification curve analysis makes this transparent by showing all defensible specifications simultaneously, letting readers judge whether the finding is robust or fragile. It is increasingly expected in top journals as a complement to pre-registration and sensitivity analysis.


The Problem of Researcher Degrees of Freedom

Every empirical analysis involves dozens of choices. Which sample do you use? Which control variables do you include? How do you define the outcome? Do you winsorize outliers at the 1st or 5th percentile? Do you use OLS or a nonlinear model? Do you cluster standard errors at the state level or the county level?

Each of these choices is defensible. But different choices lead to different results. And when a paper reports a single specification — the one the authors settled on after weeks of analysis — the reader has no way of knowing whether that particular combination of choices is the one that happened to produce the most favorable result.

This selectivity is the problem of . It is distinct from outright fraud or even intentional p-hacking. It arises because the space of reasonable analytic choices is large, and reporting only one point in that space is inherently selective.

Specification curve analysis is the systematic method for exploring the entire space of defensible choices and presenting the results transparently.

(Simonsohn et al., 2020)

Two Frameworks, One Idea

Specification Curve Analysis (SCA)

Simonsohn et al. (2020) introduced SCA as a method for assessing the robustness of a finding across all reasonable specifications. The key insight: rather than presenting one "preferred" specification with a few robustness checks in an appendix, run every defensible specification and show the full distribution of results.

Multiverse Analysis

Steegen et al. (2016) proposed multiverse analysis, which emphasizes the data processing choices that precede statistical analysis — how variables are constructed, how missing data are handled, how the sample is defined.

(Steegen et al., 2016)

How It Differs from Sensitivity Analysis for Unobservables

It is important to distinguish specification curve analysis from the sensitivity analysis covered in the sensitivity analysis practice page. They address different questions:

Specification Curve / MultiverseSensitivity Analysis (Oster, Cinelli-Hazlett)
Asks"Is the result robust across defensible analytic choices?""How strong would an unobserved confounder need to be to explain away the result?"
Source of concernResearcher degrees of freedomOmitted variable bias
VariesObserved specification choicesHypothetical unobserved variables
OutputDistribution of estimates across specificationsRobustness values, bias-adjusted estimates

A result can be robust to specification choices but fragile to unobserved confounding (or vice versa). Ideally, both dimensions are assessed.


Building a Specification Curve: Step by Step

Step 1: Define the Analytic Universe

List all defensible choices along each dimension. "Defensible" means a reasonable researcher could justify this choice on methodological grounds — not that you personally prefer it.

DimensionExample Options
OutcomeLog earnings, level earnings, employment indicator
SampleFull sample, ages 25–55 only, drop outliers above 99th pctile
ControlsNone, demographics only, demographics + baseline outcomes
Functional formOLS, Poisson, log-linear
Fixed effectsNone, year FE, state FE, state-by-year FE
Standard errorsRobust (HC1), clustered by state, clustered by individual

If you have 3 outcome definitions, 3 sample definitions, 3 control sets, 2 functional forms, 4 FE structures, and 2 SE choices, that combination is 3×3×3×2×4×2=4323 \times 3 \times 3 \times 2 \times 4 \times 2 = 432 specifications.

Step 2: Run All Specifications

This step is computationally intensive but straightforward. Loop over all combinations, estimate each model, and store the coefficient of interest, its standard error, and the p-value. Keep track of which choices produced each result.

Step 3: Plot the Specification Curve

The canonical specification curve figure has two panels:

Top panel: Point estimates (and optionally confidence intervals) sorted from smallest to largest. A horizontal line at zero shows where effects flip sign.

Bottom panel: A grid showing which analytic choices were active for each specification. Each row corresponds to one dimension (outcome, controls, sample, etc.), and dots or highlighting indicates the active choice. This layout lets readers see which choices drive the variation in results.

The bottom panel is often more informative than the top. It reveals patterns like: "the estimate is always large when we use log earnings and always small when we use levels" — which tells you something important about the nature of the result.

Step 4: Joint Inference

Simonsohn et al. (2020) propose a permutation-based joint test. Under the null of no effect:

  1. Randomly permute the treatment variable
  2. Re-run all specifications on the permuted data
  3. Compute a summary statistic (e.g., the median estimate across specs, or the share of specs with positive estimates)
  4. Repeat many times to build a null distribution
  5. Compare the observed summary statistic to this distribution

This joint test asks: "Is the overall pattern of results across all specifications more extreme than what you would expect by chance alone?"

Don't worry about the notation yet — here's what this means in words: Individual specification-level p-values suffer from multiple testing. The joint test asks whether the overall pattern across all specifications is consistent with the null, treating the entire specification curve as the unit of inference.

Suppose you run 432 specifications. Even under the null of no true effect, you expect about 432×0.0522432 \times 0.05 \approx 22 specifications to be significant at the 5% level. Simply reporting "78% of specifications are significant" is meaningless without a reference distribution — a point closely related to the multiple testing problem.

The permutation-based joint test constructs this reference distribution. For each permutation of the treatment variable, you compute all 432 specifications and record summary statistics (median estimate, share significant, share positive). The distribution of these summary statistics under the null tells you what to expect by chance. If 78% of your actual specifications are significant but the null distribution says you would expect at most 15%, the joint test rejects.

This approach sidesteps the multiple testing problem entirely, because the unit of inference is the curve, not any individual specification.


Interactive: Exploring the Specification Space

Interactive Simulation

Specification Curve Explorer

Explore how different analytic choices affect the estimated treatment effect. Toggle between outcome definitions, control sets, and sample restrictions. Watch the point estimate shift as you change each dimension. The full specification curve on the right shows all possible combinations — your current choice is highlighted.

010.7321.4632.19Simulated ValueTrue EffectNumber ofNumber ofNumber ofCross-Speci…Parameters
01
15
15
14
00.5

Set the true effect to zero and watch how many specifications still appear significant due to chance. This exercise illustrates why joint inference is essential.

Interactive Simulation

Specification Curve Explorer

Toggle specification choices to see how researcher degrees of freedom produce a wide range of estimates. Each combination of controls, sample, and outcome transformation yields a different point estimate.

0.00.91.82.73.6EstimateSpecification (sorted by estimate)AgeEduIncExcl. outliersLogYour selectionMedian specificationSignificantNot sig.
Controls
Sample
Outcome transformation

Summary

32/32

specifications significant at 5%

1.152.56

estimate range

2.02

median estimate

Your specification: estimate = 1.871, 95% CI = [0.702, 3.039]Significant

Researcher degrees of freedom: With just 5 binary choices, there are 32 possible specifications. Estimates range from 1.15 to 2.56. Specification curve analysis makes this variation transparent rather than hiding it behind a single “preferred” specification.


How to Do It: Code

library(specr)

# Create a subsetting variable for the specification space
df$age_group <- ifelse(df$age >= 25 & df$age <= 55, "prime_age", "other")

# Define the specification space (specr >= 1.0 API)
specs <- setup(
data = df,
y = c("earnings", "log_earnings"),        # outcome options
x = "treatment",                            # treatment (fixed)
model = c("lm"),                            # model type
controls = c(                               # control sets
  "age + gender",
  "age + gender + education",
  "age + gender + education + baseline_earnings"
),
subsets = list(age_group = c("prime_age"))  # named list: variable = levels
)

# Run all specifications
results <- specr(specs)

# Summary statistics
summary(results)

# The canonical specification curve plot (estimates + choice indicators)
library(patchwork)
plot(results)
Requiresspecr

How to Report a Specification Curve

A well-reported specification curve section includes:

  1. A clear description of the analytic universe. What choices did you vary, and why is each option defensible?
  2. The total number of specifications.
  3. The specification curve figure with both the estimate panel and the choice indicator panel.
  4. Summary statistics. What fraction of specifications produce a positive estimate? What fraction are significant? What is the median estimate and its interquartile range?
  5. Joint inference. The p-value from the permutation test.
  6. Discussion of what drives variation. Which choices matter most?

Example:

Figure 3 displays the specification curve across 432 specifications varying the outcome definition (3 options), sample restrictions (3 options), control variables (3 options), fixed effects (4 options), and standard error clustering (2 options). Across all specifications, 94% produce positive estimates and 78% are statistically significant at the 5% level. The median estimate is 0.12 (IQR: 0.08–0.18). The joint permutation test, based on 1,000 permutations, rejects the null of no effect (p < 0.001). The primary source of variation is the outcome definition: specifications using log earnings yield systematically larger effects than those using levels.


Common Mistakes


Concept Check

Concept Check

Your specification curve shows that 95% of specifications produce a statistically significant positive effect of a job training program. However, all specifications use the same observational data with no instrument or experiment. A colleague argues that the result is 'clearly causal' because it is robust across so many specifications. Is this reasoning correct?


Paper Library

Foundational (4)

Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification Curve Analysis.

Nature Human BehaviourDOI: 10.1038/s41562-020-0912-z

Simonsohn, Simmons, and Nelson introduced specification curve analysis, which systematically runs all reasonable specifications of a model and displays the distribution of estimates. This approach replaces selective reporting of specifications with a comprehensive view of how results depend on analytical choices.

Leamer, E. E. (1983). Let's Take the Con Out of Econometrics.

American Economic Review

Leamer's classic paper argued that the sensitivity of empirical results to specification choices undermines the credibility of econometric evidence. He proposed extreme bounds analysis, an early form of systematic robustness testing that anticipated modern specification curve analysis by several decades.

Munafo, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A Manifesto for Reproducible Science.

Nature Human BehaviourDOI: 10.1038/s41562-016-0021

This manifesto identified threats to reproducible science, including analytical flexibility and specification searching, and proposed solutions including pre-registration and multiverse analysis. It provided the broader scientific reform context within which specification curve analysis emerged as a practical tool.

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis.

Perspectives on Psychological ScienceDOI: 10.1177/1745691616658637

Steegen and colleagues introduced multiverse analysis, which examines how results vary across the full set of defensible data processing and analytical decisions. This approach is closely related to specification curve analysis and emphasizes transparency about the garden of forking paths in data analysis.

Application (5)

Young, C., & Holsteen, K. (2017). Model Uncertainty and Robustness: A Computational Framework for Multimodel Analysis.

Sociological Methods & ResearchDOI: 10.1177/0049124115610347

Young and Holsteen developed a computational framework for systematically exploring model uncertainty by running thousands of plausible specifications. Their approach is one of the earliest implementations of what would become known as specification curve or multiverse analysis, applied to sociological research.

Rohrer, J. M., Egloff, B., & Schmukle, S. C. (2017). Probing Birth-Order Effects on Narrow Traits Using Specification-Curve Analysis.

Psychological ScienceDOI: 10.1177/0956797617723726

Rohrer, Egloff, and Schmukle applied specification curve analysis to the long-debated question of whether birth order affects personality traits. By running all defensible specifications, they showed that most previously reported birth-order effects disappear, demonstrating the method's power to resolve contested empirical questions.

Goldfarb, B., & King, A. A. (2016). Scientific Apophenia in Strategic Management Research: Significance Tests & Mistaken Inference.

Strategic Management JournalDOI: 10.1002/smj.2459

Goldfarb and King documented the problem of apophenia (finding patterns in noise) in strategic management research, driven partly by selective reporting of favorable specifications. They argued for multiverse-style robustness checks, making the case for specification curve analysis in management.

Orben, A., & Przybylski, A. K. (2019). The Association between Adolescent Well-Being and Digital Technology Use.

Nature Human BehaviourDOI: 10.1038/s41562-018-0506-1

Orben and Przybylski applied specification curve analysis to the hotly debated question of whether digital technology use harms adolescent well-being, running over 20,000 specifications across three large datasets. They found that technology use has a negligible negative association with well-being, far smaller than commonly assumed, demonstrating how specification curve analysis can bring clarity to contested empirical questions by mapping the full space of defensible analytical choices.

Masicampo, E. J., & Lalande, D. (2012). A Peculiar Prevalence of p Values Just Below .05.

Quarterly Journal of Experimental PsychologyDOI: 10.1080/17470218.2012.711335

Masicampo and Lalande documented a suspicious clustering of p-values just below the .05 threshold in psychology journals, providing empirical evidence of publication bias and specification searching. Their findings motivate the use of specification curve analysis as a tool for assessing the robustness of results across analytical choices.