Specification Curve Analysis
How much do your results depend on the specific analytical choices you made? Explore the full space of defensible specifications.
When to Use Specification Curve Analysis
Use specification curve analysis whenever your analysis involves choices that could plausibly affect your results — different outcome measures, control variable sets, sample restrictions, functional forms, or standard error specifications. It is especially valuable for observational studies where there is no single "correct" specification, and for pre-registered studies where you want to demonstrate that your headline result is not an artifact of one particular set of choices.
Why It Matters
When a paper reports a single specification, readers have no way of knowing whether that combination of choices is the one that happened to produce the most favorable result. Specification curve analysis makes this transparent by showing all defensible specifications simultaneously, letting readers judge whether the finding is robust or fragile. It is increasingly expected in top journals as a complement to pre-registration and sensitivity analysis.
The Problem of Researcher Degrees of Freedom
Every empirical analysis involves dozens of choices. Which sample do you use? Which control variables do you include? How do you define the outcome? Do you winsorize outliers at the 1st or 5th percentile? Do you use OLS or a nonlinear model? Do you cluster standard errors at the state level or the county level?
Each of these choices is defensible. But different choices lead to different results. And when a paper reports a single specification — the one the authors settled on after weeks of analysis — the reader has no way of knowing whether that particular combination of choices is the one that happened to produce the most favorable result.
This selectivity is the problem of . It is distinct from outright fraud or even intentional p-hacking. It arises because the space of reasonable analytic choices is large, and reporting only one point in that space is inherently selective.
Specification curve analysis is the systematic method for exploring the entire space of defensible choices and presenting the results transparently (Simonsohn et al., 2020).
Two Frameworks, One Idea
Specification Curve Analysis (SCA)
Simonsohn et al. (2020) introduced SCA as a method for assessing the robustness of a finding across all reasonable specifications. The key insight: rather than presenting one "preferred" specification with a few in an appendix, run every defensible specification and show the full distribution of results.
Multiverse Analysis
Steegen et al. (2016) proposed multiverse analysis, which emphasizes the data processing choices that precede statistical analysis — how variables are constructed, how missing data are handled, how the sample is defined.
How It Differs from Sensitivity Analysis for Unobservables
It is important to distinguish specification curve analysis from the sensitivity analysis covered in the sensitivity analysis practice page. They address different questions:
| Specification Curve / Multiverse | Sensitivity Analysis (Oster, Cinelli-Hazlett) | |
|---|---|---|
| Asks | "Is the result robust across defensible analytic choices?" | "How strong would an unobserved confounder need to be to explain away the result?" |
| Source of concern | Researcher degrees of freedom | Omitted variable bias |
| Varies | Observed specification choices | Hypothetical unobserved variables |
| Output | Distribution of estimates across specifications | Robustness values, bias-adjusted estimates |
A result can be robust to specification choices but fragile to unobserved confounding (or vice versa). Ideally, both dimensions are assessed.
Building a Specification Curve: Step by Step
Step 1: Define the Analytic Universe
List all defensible choices along each dimension. "Defensible" means a reasonable researcher could justify this choice on methodological grounds — not that you personally prefer it.
| Dimension | Example Options |
|---|---|
| Outcome | Log earnings, level earnings, employment indicator |
| Sample | Full sample, ages 25–55 only, drop outliers above 99th pctile |
| Controls | None, demographics only, demographics + baseline outcomes |
| Functional form | OLS, Poisson, log-linear |
| Fixed effects | None, year FE, state FE, state-by-year FE |
| Standard errors | Robust (HC1), clustered by state, clustered by individual |
If you have 3 outcome definitions, 3 sample definitions, 3 control sets, 2 functional forms, 4 FE structures, and 2 SE choices, that combination is specifications.
Step 2: Run All Specifications
This step is computationally intensive but straightforward. Loop over all combinations, estimate each model, and store the coefficient of interest, its standard error, and the p-value. Keep track of which choices produced each result.
Step 3: Plot the Specification Curve
The canonical specification curve figure has two panels:
Top panel: Point estimates (and optionally confidence intervals) sorted from smallest to largest. A horizontal line at zero shows where effects flip sign.
Bottom panel: A grid showing which analytic choices were active for each specification. Each row corresponds to one dimension (outcome, controls, sample, etc.), and dots or highlighting indicates the active choice. This layout lets readers see which choices drive the variation in results.
The bottom panel is often more informative than the top. It reveals patterns like: "the estimate is always large when we use log earnings and always small when we use levels" — which tells you something important about the nature of the result.
Step 4: Joint Inference
Simonsohn et al. (2020) propose a permutation-based joint test. Under the null of no effect:
- Randomly permute the treatment variable
- Re-run all specifications on the permuted data
- Compute a summary statistic (e.g., the median estimate across specs, or the share of specs with positive estimates)
- Repeat many times to build a null distribution
- Compare the observed summary statistic to this distribution
This joint test asks: "Is the overall pattern of results across all specifications more extreme than what you would expect by chance alone?"
Don't worry about the notation yet — here's what this means in words: Individual specification-level p-values suffer from multiple testing. The joint test asks whether the overall pattern across all specifications is consistent with the null, treating the entire specification curve as the unit of inference.
Suppose you run 432 specifications. Even under the null of no true effect, you expect about specifications to be significant at the 5% level. Simply reporting "78% of specifications are significant" is meaningless without a reference distribution — a point closely related to the multiple testing problem.
The permutation-based joint test constructs this reference distribution. For each permutation of the treatment variable, you compute all 432 specifications and record summary statistics (median estimate, share significant, share positive). The distribution of these summary statistics under the null tells you what to expect by chance. If 78% of your actual specifications are significant but the null distribution says you would expect at most 15%, the joint test rejects.
This approach sidesteps the multiple testing problem entirely, because the unit of inference is the curve, not any individual specification.
Interactive: Exploring the Specification Space
How to Do It: Code
# --- Option 1: Using the specr package (>= 1.0 API) ---
# specr automates specification curve construction and plotting
library(specr)
# --- Step 1: Prepare the data ---
# Create a subgroup variable for subsetting specifications
df$age_group <- ifelse(df$age >= 25 & df$age <= 55, "prime_age", "other")
# --- Step 2: Define the analytic universe ---
# setup() enumerates all combinations of outcomes, controls, and subsets
specs <- setup(
data = df,
y = c("earnings", "log_earnings"), # two outcome definitions
x = "treatment", # treatment variable (fixed)
model = c("lm"), # model type(s)
controls = c(
"age + gender", # minimal controls
"age + gender + education", # moderate controls
"age + gender + education + baseline_earnings" # full controls
),
subsets = list(age_group = c("prime_age")) # optional subsample
)
# --- Step 3: Run all specifications and inspect results ---
# specr() estimates each combination and collects coefficients + SEs
results <- specr(specs)
summary(results) # shows distribution of estimates across specs
# --- Step 4: Plot the specification curve ---
# patchwork arranges the estimate panel and choice-indicator panel
library(patchwork)
plot(results) # top = sorted estimates, bottom = active choices
# ============================================================
# --- Option 2: Manual loop + joint permutation test ---
# Use this when you need full control or specr does not fit your design
library(broom); library(dplyr); library(purrr)
# --- Step 1: Define each dimension of the analytic universe ---
# Outcomes: different ways to measure the dependent variable
outcomes <- c("earnings", "log_earnings", "employed")
# Control sets: from no controls to a full covariate set
control_sets <- list(
none = character(0),
demo = c("age", "gender"),
full = c("age", "gender", "education", "baseline_earnings")
)
# Sample definitions: full sample vs. prime-age restriction
samples <- list(
full = function(d) d,
prime = function(d) filter(d, age >= 25, age <= 55)
)
# --- Step 2: Run every combination and extract treatment coefficient ---
run_specs <- function(data) {
# expand.grid creates one row per specification
specs <- expand.grid(
outcome = outcomes,
controls = names(control_sets),
sample = names(samples),
stringsAsFactors = FALSE
)
# Iterate over specifications and collect tidy regression output
map_dfr(seq_len(nrow(specs)), function(i) {
s <- specs[i, ]
d <- samples[[s$sample]](data)
covs <- control_sets[[s$controls]]
# Build the regression formula dynamically
fml <- reformulate(c("treatment", covs), response = s$outcome)
fit <- lm(fml, data = d)
# Extract only the treatment row from the coefficient table
tidy(fit) %>%
filter(term == "treatment") %>%
mutate(outcome = s$outcome, controls = s$controls, sample = s$sample)
})
}
# Run all specifications on the observed data
observed <- run_specs(df)
# --- Step 3: Joint permutation test (Simonsohn et al. 2020) ---
# Randomly permute treatment to build a null distribution
# of the median estimate across all specifications
set.seed(42)
null_medians <- replicate(1000, {
df_perm <- df
df_perm$treatment <- sample(df_perm$treatment) # break treatment assignment
perm_results <- run_specs(df_perm)
median(perm_results$estimate) # summary statistic under the null
})
# p-value: fraction of null medians as extreme as the observed median
p_joint <- mean(abs(null_medians) >= abs(median(observed$estimate)))How to Report a Specification Curve
A well-reported specification curve section includes:
- A clear description of the analytic universe. What choices did you vary, and why is each option defensible?
- The total number of specifications.
- The specification curve figure with both the estimate panel and the choice indicator panel.
- Summary statistics. What fraction of specifications produce a positive estimate? What fraction are significant? What is the median estimate and its interquartile range?
- Joint inference. The p-value from the permutation test.
- Discussion of what drives variation. Which choices matter most?
Example:
Figure 3 displays the specification curve across 432 specifications varying the outcome definition (3 options), sample restrictions (3 options), control variables (3 options), fixed effects (4 options), and standard error clustering (2 options). Across all specifications, 94% produce positive estimates and 78% are statistically significant at the 5% level. The median estimate is 0.12 (IQR: 0.08–0.18). The joint permutation test, based on 1,000 permutations, rejects the null of no effect (p < 0.001). The primary source of variation is the outcome definition: specifications using log earnings yield systematically larger effects than those using levels.
Common Mistakes
Concept Check
Your specification curve shows that 95% of specifications produce a statistically significant positive effect of a job training program. However, all specifications use the same observational data with no instrument or experiment. A colleague argues that the result is 'clearly causal' because it is robust across so many specifications. Is this reasoning correct?
Paper Library
Foundational (5)
Leamer, E. E. (1983). Let's Take the Con Out of Econometrics.
Leamer's classic paper argues that the sensitivity of empirical results to specification choices undermines the credibility of econometric evidence. He proposes extreme bounds analysis, an early form of systematic robustness testing that anticipates modern specification curve analysis by several decades.
Munafo, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A Manifesto for Reproducible Science.
Munafo, Nosek, and colleagues identify threats to reproducible science and propose a broad reform agenda spanning methods, reporting, reproducibility practices, evaluation, and incentives. The article is a general reproducibility manifesto that provides the broader scientific reform context motivating robustness-analysis approaches.
Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification Curve Analysis.
Simonsohn, Simmons, and Nelson introduce specification curve analysis, which systematically runs all reasonable specifications of a model and displays the distribution of estimates. This approach replaces selective reporting of specifications with a comprehensive view of how results depend on analytical choices.
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis.
Steegen and colleagues introduce multiverse analysis, which examines how results vary across the full set of defensible data processing and analytical decisions. This approach is closely related to specification curve analysis and emphasizes transparency about the garden of forking paths in data analysis.
Young, C., & Holsteen, K. (2017). Model Uncertainty and Robustness: A Computational Framework for Multimodel Analysis.
Young and Holsteen develop a computational framework for systematically exploring model uncertainty by running thousands of plausible specifications. Their approach is one of the earliest implementations of what would become known as specification curve or multiverse analysis, applied to sociological research.
Application (4)
Goldfarb, B., & King, A. A. (2016). Scientific Apophenia in Strategic Management Research: Significance Tests & Mistaken Inference.
Goldfarb and King use distributional matching and posterior predictive checks to estimate that 24-40% of significant coefficients in strategic management research would become insignificant if studies were repeated. They document the problem of apophenia (finding patterns in noise) and offer practical suggestions for reducing false and inflated findings at both the individual and field level.
Masicampo, E. J., & Lalande, D. (2012). A Peculiar Prevalence of p Values Just Below .05.
Masicampo and Lalande document a suspicious clustering of p-values just below the .05 threshold in psychology journals, providing empirical evidence of publication bias and researcher degrees of freedom. They discuss potential sources of this pattern and its implications for the credibility of published findings in the social sciences.
Orben, A., & Przybylski, A. K. (2019). The Association between Adolescent Well-Being and Digital Technology Use.
Orben and Przybylski apply specification curve analysis to the hotly debated question of whether digital technology use harms adolescent well-being, running over 20,000 specifications across three large datasets. They find that technology use has a negligible negative association with well-being, far smaller than commonly assumed, demonstrating how specification curve analysis can bring clarity to contested empirical questions by mapping the full space of defensible analytical choices.
Rohrer, J. M., Egloff, B., & Schmukle, S. C. (2017). Probing Birth-Order Effects on Narrow Traits Using Specification-Curve Analysis.
Rohrer, Egloff, and Schmukle apply specification curve analysis to the long-debated question of whether birth order affects personality traits. By running all defensible specifications, they show that most previously reported birth-order effects disappear, demonstrating the method's power to resolve contested empirical questions.