MethodAtlas
Practice·Estimation Stage·8 min read
Estimation Stage

Multiple Hypothesis Testing

Testing many hypotheses inflates false positives. Bonferroni, Holm, BH-FDR, and Romano-Wolf corrections.

When to Use Multiple Testing Corrections

Apply multiple testing corrections whenever your analysis involves more than one hypothesis test within a family of related tests — multiple outcomes, multiple subgroups, multiple specifications, or multiple time horizons addressing the same research question. The correction is needed regardless of whether the tests were pre-registered. Common settings include: multi-arm experiments, heterogeneity analyses, event study pre-trend tests, and specification curve analyses.


The Problem of Too Many Tests

Imagine you run an experiment evaluating a tutoring program. You measure its effect on math scores, reading scores, attendance, behavior, self-confidence, parental involvement, peer relationships, and teacher ratings. That list is eight outcomes. Even if the program does absolutely nothing, with eight independent tests at the 5% level, you have roughly a 34% chance of finding at least one "significant" result.

If this probability does not alarm you, it should.

This inflation is the multiple testing problem, and it is everywhere in empirical research. Any time you test more than one hypothesis — multiple outcomes, multiple subgroups, multiple specifications, multiple time periods — the probability of finding at least one spurious "significant" result rises rapidly with the number of tests. Without correction, your paper's headline finding might be nothing more than a statistical accident.


Why Testing Many Hypotheses Inflates False Positives

Under the null hypothesis of no effect, each individual test has a 5% probability of a false positive (a Type I error). For mm independent tests, the probability of at least one false positive — the — is:

FWER=1(1α)m\text{FWER} = 1 - (1 - \alpha)^m

For α=0.05\alpha = 0.05:

Number of tests (mm)Probability of at least one false positive
15%
523%
1040%
2064%
5092%
10099.4%

With 20 tests, you are more likely than not to find a false positive. With 100 tests, it is virtually certain.


Two Philosophies: FWER vs. FDR

Before choosing a correction method, you need to decide what you are trying to control.

Familywise Error Rate (FWER)

Control the probability of making even one false positive across all tests. FWER control is the strictest standard. You use FWER control when even a single false claim is unacceptable — clinical trials, policy decisions, confirmatory analyses.

False Discovery Rate (FDR)

Control the expected proportion of false positives among all rejected hypotheses. If you reject 20 hypotheses and allow = 0.05, you expect about 1 of those 20 to be a false discovery. FDR is less conservative and more powerful. You use FDR control when you are doing exploratory analysis, testing many outcomes, or when some false positives are tolerable.


The Methods

1. Bonferroni Correction

The simplest approach. Divide your significance threshold by the number of tests:

αadj=αm\alpha_{\text{adj}} = \frac{\alpha}{m}

Equivalently, multiply each p-value by mm and compare to α\alpha.

Pros: Dead simple. Valid under any dependence structure. Cons: Extremely conservative, especially with many correlated tests. If your 20 outcomes are all measuring similar things (and thus correlated), overcorrects badly.

2. Holm Step-Down Correction

A strict improvement over Bonferroni that is always at least as powerful, yet still controls FWER.

Procedure:

  1. Sort your mm p-values from smallest to largest: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Starting from the smallest, reject H(k)H_{(k)} if p(k)αmk+1p_{(k)} \leq \frac{\alpha}{m - k + 1}
  3. Stop at the first non-rejection; do not reject any remaining hypotheses.

The threshold becomes less stringent as you work through the sorted list, giving you more power than Bonferroni.

3. Benjamini-Hochberg (BH) FDR Correction

Controls the false discovery rate rather than the familywise error rate (Benjamini & Hochberg, 1995).

Procedure:

  1. Sort p-values from smallest to largest: p(1)p(m)p_{(1)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmqp_{(k)} \leq \frac{k}{m} \cdot q, where qq is the desired FDR level
  3. Reject all hypotheses H(1),,H(k)H_{(1)}, \ldots, H_{(k)}

The procedure is substantially more powerful than Bonferroni or Holm, but it controls a weaker error rate.

4. Anderson (2008) Sharpened q-Values

Anderson (2008) adapted the BH procedure for the economics context, producing "sharpened" q-values that account for the share of true null hypotheses in the family of tests.

The key insight: the standard BH procedure assumes all nulls could be true. If you can estimate the share of true nulls (π0\pi_0), you can sharpen the threshold and gain power. Anderson provides a Stata do-file and algorithm for computing these sharpened q-values.

5. Romano-Wolf Stepdown

The most widely recommended method for FWER control in applied economics. Romano and Wolf (2005) developed a stepdown procedure that uses resampling (bootstrap or randomization) to capture the joint dependence structure of the test statistics.

Why it matters: Unlike Bonferroni and Holm, Romano-Wolf accounts for the correlation between test statistics. If your outcomes are highly correlated (as is typical), the effective number of independent tests is much smaller than the nominal count, and Romano-Wolf exploits this to give you more power.

Procedure (simplified):

  1. Compute all test statistics from the original data
  2. Bootstrap (or permute) the data many times, recomputing all test statistics each time
  3. Use the joint distribution of the bootstrapped statistics to compute adjusted p-values
  4. Apply a stepdown algorithm that sequentially removes rejected hypotheses

How to Do It: Code

# Raw p-values from multiple tests
p_values <- c(0.003, 0.012, 0.041, 0.052, 0.089, 0.210, 0.430)

# Bonferroni
p.adjust(p_values, method = "bonferroni")

# Holm
p.adjust(p_values, method = "holm")

# Benjamini-Hochberg FDR
p.adjust(p_values, method = "BH")

# Compare all methods
data.frame(
raw = p_values,
bonferroni = p.adjust(p_values, method = "bonferroni"),
holm = p.adjust(p_values, method = "holm"),
bh = p.adjust(p_values, method = "BH")
)

Anderson Sharpened q-Values and Romano-Wolf Stepdown

# Requires: wildrwolf (install.packages("wildrwolf"))
library(wildrwolf)

# Romano-Wolf stepdown adjusted p-values
# Fit models for each outcome
models <- list(
lm(outcome1 ~ treatment + x1 + x2, data = df),
lm(outcome2 ~ treatment + x1 + x2, data = df),
lm(outcome3 ~ treatment + x1 + x2, data = df),
lm(outcome4 ~ treatment + x1 + x2, data = df)
)

# Romano-Wolf adjusted p-values via wild bootstrap
rw_result <- rwolf(
models = models,
param = "treatment",
B = 1000,
seed = 12345
)
print(rw_result)

Interactive: Watching False Discoveries Accumulate

Set the share of true effects to zero and run 20 tests. You will almost certainly find at least one "significant" result. Now turn on corrections and watch the false positives disappear.

Compare the performance of Bonferroni, Holm, and Benjamini-Hochberg corrections side by side. Adjust the number of tests, fraction of true nulls, and effect size to see how each method trades off power against false discovery control:


How to Report Multiple Testing Corrections

A good reporting practice includes:

  1. State how many tests you run and how you group them into families.
  2. Report both unadjusted and adjusted p-values (or q-values) side by side.
  3. Specify the method and why you chose it (FWER vs. FDR, accounting for correlation or not).
  4. Define families clearly. Not every test in your paper needs to be in the same family. Group tests that address the same question or the same set of outcomes. Primary outcomes are one family; secondary outcomes may be another; subgroup analyses may be a third. When designing your study, a pre-analysis plan that specifies family groupings in advance adds credibility.

Example table format:

OutcomeCoefficientSERaw p-valueRomano-Wolf p-valueBH q-value
Math scores0.150.050.0030.0120.009
Reading scores0.080.040.0410.0890.054
Attendance0.030.020.0890.2100.104

Common Mistakes


Concept Check

Concept Check

You test the effect of a program on 10 independent outcomes, all of which are truly zero. You use alpha = 0.05 with no correction. What is the approximate probability of finding at least one significant result?


Paper Library

Foundational (6)

Anderson, M. L. (2008). Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects.

Journal of the American Statistical AssociationDOI: 10.1198/016214508000000841

Anderson proposes using summary index tests and familywise error rate corrections to address multiple inference in program evaluation. Reanalyzing the Abecedarian, Perry Preschool, and Early Training Projects, he finds that girls garner substantial short- and long-term benefits from early interventions, but there are no significant long-term benefits for boys after correcting for multiple testing.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

Journal of the Royal Statistical Society: Series BDOI: 10.1111/j.2517-6161.1995.tb02031.x

Benjamini and Hochberg introduce the false discovery rate (FDR) as an alternative to family-wise error rate control. Their step-up procedure for controlling FDR is less conservative than Bonferroni while still providing meaningful protection against false positives, and has become the standard in many fields.

Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità.

Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze

Bonferroni develops the classical correction for multiple comparisons, which controls the family-wise error rate by dividing the significance level by the number of tests. While conservative, the Bonferroni correction remains widely used due to its simplicity and broad applicability.

Clarke, D., Romano, J. P., & Wolf, M. (2020). The Romano-Wolf Multiple-Hypothesis Correction in Stata.

Clarke, Romano, and Wolf develop a Stata implementation of the Romano-Wolf stepwise multiple testing correction, which controls the family-wise error rate while accounting for the dependence structure among test statistics via resampling. This correction is more powerful than Bonferroni or Holm procedures when test statistics are correlated, which is the typical case in applied research with related outcomes. The rwolf command provides applied researchers with an accessible tool for rigorous multiple hypothesis testing.

Romano, J. P., & Wolf, M. (2005). Stepwise Multiple Testing as Formalized Data Snooping.

Romano and Wolf develop a stepwise multiple testing procedure that controls the family-wise error rate while being less conservative than Bonferroni by resampling from the joint distribution of test statistics. Their method accounts for the correlation structure among tests and is widely used in economics.

Westfall, P. H., & Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment.

Wiley

Westfall and Young develop resampling-based methods for multiple testing that account for the dependence structure among test statistics. Their permutation-based step-down procedure is less conservative than Bonferroni and becomes a standard reference for multiple testing adjustments in applied research.

Application (2)

Casey, K., Glennerster, R., & Miguel, E. (2012). Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qje027

Casey, Glennerster, and Miguel pre-registered their analysis plan for a community-driven development program in Sierra Leone and apply multiple testing corrections (including the Westfall-Young step-down procedure and family-wise error rate adjustments) across outcome families. This paper is one of the most prominent examples of rigorous multiple testing adjustment in a field experiment, demonstrating that many individually significant effects lose significance after correction.

Haushofer, J., & Shapiro, J. (2016). The Short-Term Impact of Unconditional Cash Transfers to the Poor: Experimental Evidence from Kenya.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjw025

Haushofer and Shapiro evaluate GiveDirectly's unconditional cash transfer program in Kenya, testing effects across many outcome domains including consumption, assets, food security, health, and psychological well-being. They apply FWER corrections with bootstrapped p-values across outcome families, providing a model for how to handle multiple testing transparently in large-scale randomized evaluations. A 2017 erratum (QJE 132(4): 2057–2060) corrected the FWER-adjusted p-values in Tables I and II, which had used insufficient bootstrap iterations.

Survey (1)

List, J. A., Shaikh, A. M., & Xu, Y. (2019). Multiple Hypothesis Testing in Experimental Economics.

Experimental EconomicsDOI: 10.1007/s10683-018-09597-5

List, Shaikh, and Xu provide practical guidance on addressing multiple hypothesis testing in experimental economics. They compare various correction methods including Bonferroni, Holm, and FDR procedures, and demonstrate their application to field experiments with multiple outcome variables.