Practice·Robustness Stage·10 min read

Robustness Stage

Specification Curve Analysis

How much do your results depend on the specific analytical choices you made? Explore the full space of defensible specifications.

Applies To: OLS (Robust SEs, Clustering), Difference-in-Differences (Canonical 2×2), Fixed Effects (Two-Way FE), Random Effects, Interrupted Time Series (ITS), Regression Kink Design (RKD)
Reading Time: ~10 min read · 11 sections · 2 interactive exercises · 9 papers

When to Use Specification Curve Analysis

Use specification curve analysis whenever your analysis involves choices that could plausibly affect your results — different outcome measures, control variable sets, sample restrictions, functional forms, or standard error specifications. It is especially valuable for observational studies where there is no single "correct" specification, and for pre-registered studies where the goal is to demonstrate that the headline result is not an artifact of one particular set of choices.

Why It Matters

Specification curve analysis makes analytic selectivity transparent by showing all defensible specifications simultaneously, letting readers judge whether the finding is robust or fragile. The technique is increasingly expected in top journals as a complement to pre-registration and sensitivity analysis.

The Problem of Researcher Degrees of Freedom

Every empirical analysis involves dozens of choices. Which sample do you use? Which control variables do you include? How do you define the outcome? Do you winsorize outliers at the 1st or 5th percentile? Do you use OLS or a nonlinear model? Do you cluster standard errors at the state level or the county level?

Each of these choices is defensible. But different choices lead to different results. And when a paper reports a single specification — the one the authors settled on after weeks of analysis — the reader has no way of knowing whether that particular combination of choices is the one that happened to produce the most favorable result.

This selectivity is the problem of . It is distinct from outright fraud or even intentional p-hacking. It arises because the space of reasonable analytic choices is large, and reporting only one point in that space is inherently selective.

Specification curve analysis is the systematic method for exploring the entire space of defensible choices and presenting the results transparently (Simonsohn et al., 2020).

When Should You Run a Specification Curve?

Run a specification curve analysis when:

Your research design involves discretionary analytical choices (variable definitions, sample restrictions, control sets, functional forms) that could reasonably be made differently
A referee or seminar participant could plausibly suggest "what if you used a different measure of X?" or "what if you dropped industry Y?"
You want to demonstrate that your main finding is not an artifact of one particular specification

You do not need a specification curve when:

Your design has a single natural specification (e.g., a sharp RDD with an unambiguous running variable and cutoff)
The analytical choices are dictated by theory or pre-registration with no defensible alternatives
You have already run a pre-registered analysis — though even then, a specification curve in the appendix can strengthen credibility

Two Frameworks, One Idea

Specification Curve Analysis (SCA)

Simonsohn et al. (2020) introduced SCA as a method for assessing the robustness of a finding across all reasonable specifications. The key insight: rather than presenting one "preferred" specification with a few in an appendix, run every defensible specification and show the full distribution of results.

Multiverse Analysis

Steegen et al. (2016) proposed multiverse analysis, which emphasizes the data processing choices that precede statistical analysis — how variables are constructed, how missing data are handled, how the sample is defined.

How It Differs from Sensitivity Analysis for Unobservables

It is important to distinguish specification curve analysis from the sensitivity analysis covered in the sensitivity analysis practice page. They address different questions:

	Specification Curve / Multiverse	Sensitivity Analysis (Oster, Cinelli-Hazlett)
Asks	"Is the result robust across defensible analytic choices?"	"How strong would an unobserved confounder need to be to explain away the result?"
Source of concern	Researcher degrees of freedom	Omitted variable bias
Varies	Observed specification choices	Hypothetical unobserved variables
Output	Distribution of estimates across specifications	Robustness values, bias-adjusted estimates

A result can be robust to specification choices but fragile to unobserved confounding (or vice versa). Ideally, both dimensions are assessed.

Building a Specification Curve: Step by Step

Step 1: Define the Analytic Universe

List all defensible choices along each dimension. "Defensible" means a reasonable researcher could justify this choice on methodological grounds — not that you personally prefer it.

Dimension	Example Options
Outcome	Log earnings, level earnings, employment indicator
Sample	Full sample, ages 25–55 only, drop outliers above 99th pctile
Controls	None, demographics only, demographics + baseline outcomes
Functional form	OLS, Poisson, log-linear
Fixed effects	None, year FE, state FE, state-by-year FE
Standard errors	Robust (HC1), clustered by state, clustered by individual

If you have 3 outcome definitions, 3 sample definitions, 3 control sets, 2 functional forms, 4 FE structures, and 2 SE choices, that combination is $3 \times 3 \times 3 \times 2 \times 4 \times 2 = 432$ specifications.

Step 2: Run All Specifications

This step is computationally intensive but straightforward. Loop over all combinations, estimate each model, and store the coefficient of interest, its standard error, and the p-value. Keep track of which choices produced each result.

Step 3: Plot the Specification Curve

The canonical specification curve figure has two panels:

Top panel: Point estimates (and optionally confidence intervals) sorted from smallest to largest. A horizontal line at zero shows where effects flip sign.

Bottom panel: A grid showing which analytic choices were active for each specification. Each row corresponds to one dimension (outcome, controls, sample, etc.), and dots or highlighting indicates the active choice. This layout lets readers see which choices drive the variation in results.

The bottom panel is often more informative than the top. It reveals patterns like: "the estimate is always large when we use log earnings and always small when we use levels" — which tells you something important about the nature of the result.

Step 4: Joint Inference

Simonsohn et al. (2020) propose a permutation-based joint test. Under the null of no effect:

Randomly permute the treatment variable
Re-run all specifications on the permuted data
Compute a summary statistic (e.g., the median estimate across specs, or the share of specs with positive estimates)
Repeat many times to build a null distribution
Compare the observed summary statistic to this distribution

This joint test asks: "Is the overall pattern of results across all specifications more extreme than what you would expect by chance alone?"

Don't worry about the notation yet — here's what this means in words: Individual specification-level p-values suffer from multiple testing. The joint test asks whether the overall pattern across all specifications is consistent with the null, treating the entire specification curve as the unit of inference.

Suppose you run 432 specifications. Even under the null of no true effect, you expect about $432 \times 0.05 \approx 22$ specifications to be significant at the 5% level. Simply reporting "78% of specifications are significant" is meaningless without a reference distribution — a point closely related to the multiple testing problem.

The permutation-based joint test constructs this reference distribution. For each permutation of the treatment variable, you compute all 432 specifications and record summary statistics (median estimate, share significant, share positive). The distribution of these summary statistics under the null tells you what to expect by chance. If 78% of your actual specifications are significant but the null distribution says you would expect at most 15%, the joint test rejects.

The joint-test framing sidesteps the multiple testing problem entirely, because the unit of inference is the curve, not any individual specification.

Interactive: Exploring the Specification Space

How to Do It: Code

1# --- Option 1: Using the specr package (>= 1.0 API) ---
2# specr automates specification curve construction and plotting
3library(specr)
4
5# --- Step 1: Prepare the data ---
6# Create a subgroup variable for subsetting specifications
7df$age_group <- ifelse(df$age >= 25 & df$age <= 55, "prime_age", "other")
8
9# --- Step 2: Define the analytic universe ---
10# setup() enumerates all combinations of outcomes, controls, and subsets
11specs <- setup(
12data = df,
13y = c("earnings", "log_earnings"),       # two outcome definitions
14x = "treatment",                          # treatment variable (fixed)
15model = c("lm"),                          # model type(s)
16controls = c(
17  "age + gender",                         # minimal controls
18  "age + gender + education",             # moderate controls
19  "age + gender + education + baseline_earnings"  # full controls
20),
21subsets = list(age_group = c("prime_age"))  # optional subsample
22)
23
24# --- Step 3: Run all specifications and inspect results ---
25# specr() estimates each combination and collects coefficients + SEs
26results <- specr(specs)
27summary(results)  # shows distribution of estimates across specs
28
29# --- Step 4: Plot the specification curve ---
30# patchwork arranges the estimate panel and choice-indicator panel
31library(patchwork)
32plot(results)  # top = sorted estimates, bottom = active choices
33
34# ============================================================
35# --- Option 2: Manual loop + joint permutation test ---
36# Use this when you need full control or specr does not fit your design
37library(broom); library(dplyr); library(purrr)
38
39# --- Step 1: Define each dimension of the analytic universe ---
40# Outcomes: different ways to measure the dependent variable
41outcomes <- c("earnings", "log_earnings", "employed")
42
43# Control sets: from no controls to a full covariate set
44control_sets <- list(
45none = character(0),
46demo = c("age", "gender"),
47full = c("age", "gender", "education", "baseline_earnings")
48)
49
50# Sample definitions: full sample vs. prime-age restriction
51samples <- list(
52full = function(d) d,
53prime = function(d) filter(d, age >= 25, age <= 55)
54)
55
56# --- Step 2: Run every combination and extract treatment coefficient ---
57run_specs <- function(data) {
58# expand.grid creates one row per specification
59specs <- expand.grid(
60  outcome = outcomes,
61  controls = names(control_sets),
62  sample = names(samples),
63  stringsAsFactors = FALSE
64)
65# Iterate over specifications and collect tidy regression output
66map_dfr(seq_len(nrow(specs)), function(i) {
67  s <- specs[i, ]
68  d <- samples[[s$sample]](data)
69  covs <- control_sets[[s$controls]]
70  # Build the regression formula dynamically
71  fml <- reformulate(c("treatment", covs), response = s$outcome)
72  fit <- lm(fml, data = d)
73  # Extract only the treatment row from the coefficient table
74  tidy(fit) %>%
75    filter(term == "treatment") %>%
76    mutate(outcome = s$outcome, controls = s$controls, sample = s$sample)
77})
78}
79
80# Run all specifications on the observed data
81observed <- run_specs(df)
82
83# --- Step 3: Joint permutation test (Simonsohn et al. 2020) ---
84# Randomly permute treatment to build a null distribution
85# of the median estimate across all specifications
86set.seed(42)
87null_medians <- replicate(1000, {
88df_perm <- df
89df_perm$treatment <- sample(df_perm$treatment)  # break treatment assignment
90perm_results <- run_specs(df_perm)
91median(perm_results$estimate)  # summary statistic under the null
92})
93# p-value: fraction of null medians as extreme as the observed median
94p_joint <- mean(abs(null_medians) >= abs(median(observed$estimate)))

Requiresspecr broom dplyr purrr

How to Report a Specification Curve

A well-reported specification curve section includes:

A clear description of the analytic universe. What choices did you vary, and why is each option defensible?
The total number of specifications.
The specification curve figure with both the estimate panel and the choice indicator panel.
Summary statistics. What fraction of specifications produce a positive estimate? What fraction are significant? What is the median estimate and its interquartile range?
Joint inference. The p-value from the permutation test.
Discussion of what drives variation. Which choices matter most?

Example:

Figure 3 displays the specification curve across 432 specifications varying the outcome definition (3 options), sample restrictions (3 options), control variables (3 options), fixed effects (4 options), and standard error clustering (2 options). Across all specifications, 94% produce positive estimates and 78% are statistically significant at the 5% level. The median estimate is 0.12 (IQR: 0.08–0.18). The joint permutation test, based on 1,000 permutations, rejects the null of no effect (p < 0.001). The primary source of variation is the outcome definition: specifications using log earnings yield systematically larger effects than those using levels.

Common Mistakes

Pitfalls to avoid

Including unreasonable specifications. The universe should contain only defensible specifications. Including absurd combinations (e.g., controlling for a post-treatment variable, or using a clearly incorrect functional form) pollutes the analysis. Be comprehensive but principled.
Cherry-picking the universe. If you define the universe after seeing which specifications support your result, you are just p-hacking with extra steps. Define the universe before examining results, or at minimum, justify each inclusion decision on methodological grounds.
Treating the specification curve as a substitute for identification. A flat specification curve does not prove causation. If all 432 specifications suffer from the same omitted variable bias, they will tend to be wrong in the same direction. Specification curves assess robustness to analytic choices, not to identification threats.
Not showing the choice indicators. The bottom panel of the specification curve — showing which choices drive variation — is often more informative than the estimates themselves. Without it, readers cannot learn why results vary.
Reporting only the share of significant results. A specification curve where 100% of specifications are significant but estimates range from 0.01 to 5.0 is not "robust" — it just means everything is precisely estimated. Report the distribution of point estimates, not just significance.
Omitting the joint test. Without a formal test, it is hard to know whether the observed pattern is unusual relative to the null. The permutation-based joint test of Simonsohn et al. (2020) provides this reference point.
Running specifications that are not independent. If many specifications are nearly identical (differing by a single trivial choice), they will produce very similar results and give a misleading impression of robustness. Ensure genuine variation across the analytic dimensions.

Concept Check

Your specification curve shows that 95% of specifications produce a statistically significant positive effect of a job training program. However, all specifications use the same observational data with no instrument or experiment. A colleague argues that the result is 'clearly causal' because it is robust across so many specifications. Is this reasoning correct?

Yes — robustness across many specifications is strong evidence of causation.No — if all specifications share the same identification problem, they will tend to be biased in the same direction regardless of how many you run.Yes, but only if a joint permutation test is also significant.It depends on whether the multiverse analysis was also conducted.

Paper Library

Has replication code

Foundational (5)

Leamer, E. E. (1983). Let's Take the Con Out of Econometrics.

American Economic Review

Leamer's classic paper argues that the sensitivity of empirical results to specification choices undermines the credibility of econometric evidence. He proposes extreme bounds analysis, an early form of systematic robustness testing that anticipated modern specification curve analysis by several decades.

Munafo, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A Manifesto for Reproducible Science.

Nature Human BehaviourDOI: 10.1038/s41562-016-0021

Munafo, Nosek, and colleagues identify threats to reproducible science and propose a broad reform agenda spanning methods, reporting, reproducibility practices, evaluation, and incentives. The article is a general reproducibility manifesto that provides the broader scientific reform context motivating robustness-analysis approaches.

Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification Curve Analysis.

Nature Human BehaviourDOI: 10.1038/s41562-020-0912-z

Simonsohn, Simmons, and Nelson introduce specification curve analysis, which systematically runs all reasonable specifications of a model and displays the distribution of estimates. This approach replaces selective reporting of specifications with a comprehensive view of how results depend on analytical choices.

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis.

Perspectives on Psychological ScienceDOI: 10.1177/1745691616658637

Steegen and colleagues introduce multiverse analysis, which examines how results vary across the full set of defensible data processing and analytical decisions. This approach is closely related to specification curve analysis and emphasizes transparency about the garden of forking paths in data analysis.

Young, C., & Holsteen, K. (2017). Model Uncertainty and Robustness: A Computational Framework for Multimodel Analysis.

Sociological Methods & ResearchDOI: 10.1177/0049124115610347

Young and Holsteen develop a computational framework for systematically exploring model uncertainty by running thousands of plausible specifications. Their approach is one of the earliest implementations of what would become known as specification curve or multiverse analysis, applied to sociological research.

Application (4)

Goldfarb, B., & King, A. A. (2016). Scientific Apophenia in Strategic Management Research: Significance Tests & Mistaken Inference.

Strategic Management JournalDOI: 10.1002/smj.2459

Goldfarb and King use distributional matching and posterior predictive checks to estimate that 24-40% of significant coefficients in strategic management research would become insignificant if studies were repeated. They document the problem of apophenia (finding patterns in noise) and offer practical suggestions for reducing false and inflated findings at both the individual and field level.

Masicampo, E. J., & Lalande, D. (2012). A Peculiar Prevalence of p Values Just Below .05.

Quarterly Journal of Experimental PsychologyDOI: 10.1080/17470218.2012.711335

Masicampo and Lalande document a suspicious clustering of p-values just below the .05 threshold in psychology journals, providing empirical evidence of publication bias and researcher degrees of freedom. They discuss potential sources of this pattern and its implications for the credibility of published findings in the social sciences.

Orben, A., & Przybylski, A. K. (2019). The Association between Adolescent Well-Being and Digital Technology Use.

Nature Human BehaviourDOI: 10.1038/s41562-018-0506-1

Orben and Przybylski apply specification curve analysis to the hotly debated question of whether digital technology use harms adolescent well-being, running over 20,000 specifications across three large datasets. They find that technology use has a negligible negative association with well-being, far smaller than commonly assumed, demonstrating how specification curve analysis can bring clarity to contested empirical questions by mapping the full space of defensible analytical choices.

Rohrer, J. M., Egloff, B., & Schmukle, S. C. (2017). Probing Birth-Order Effects on Narrow Traits Using Specification-Curve Analysis.

Psychological ScienceDOI: 10.1177/0956797617723726

Rohrer, Egloff, and Schmukle apply specification curve analysis to birth-order effects on personality. Across thousands of defensible specifications they confirm a robust effect on intellect but find essentially null effects on life satisfaction, locus of control, trust, reciprocity, risk taking, patience, impulsivity, and political orientation, demonstrating how specification curves can adjudicate which contested birth-order effects survive analytical flexibility.

When to Use Specification Curve Analysis#

Why It Matters#

The Problem of Researcher Degrees of Freedom#

Two Frameworks, One Idea#

Specification Curve Analysis (SCA)#

Multiverse Analysis#

How It Differs from Sensitivity Analysis for Unobservables#

Building a Specification Curve: Step by Step#

Step 1: Define the Analytic Universe#

Step 2: Run All Specifications#

Step 3: Plot the Specification Curve#

Step 4: Joint Inference#

Interactive: Exploring the Specification Space#

How to Do It: Code#

How to Report a Specification Curve#

Common Mistakes#

Concept Check#