Practice·Estimation Stage·9 min read

Estimation Stage

Multiple Hypothesis Testing

Testing many hypotheses inflates false positives. Bonferroni, Holm, BH-FDR, and Romano-Wolf corrections.

Applies To: Experimental Design, Event Studies (Dynamic Treatment Effects), Double/Debiased Machine Learning (DML), Causal Forests / Heterogeneous Treatment Effects
Reading Time: ~9 min read · 10 sections · 2 interactive exercises · 9 papers

When to Use Multiple Testing Corrections

Apply multiple testing corrections whenever your analysis involves more than one hypothesis test within a family of related tests — multiple outcomes, multiple subgroups, multiple specifications, or multiple time horizons addressing the same research question. Common settings include multi-arm experiments, heterogeneity analyses, event study pre-trend tests, and specification curve analyses. The "When Do You Need Multiple Testing Corrections?" callout below summarizes the decision rule.

The Problem of Too Many Tests

Imagine you run an experiment evaluating a tutoring program. You measure its effect on math scores, reading scores, attendance, behavior, self-confidence, parental involvement, peer relationships, and teacher ratings. That list is eight outcomes. Even if the program does absolutely nothing, with eight independent tests at the 5% level, you have roughly a 34% chance of finding at least one "significant" result.

If this probability does not alarm you, it should.

This inflation is the multiple testing problem, and it is everywhere in empirical research. Any time you test more than one hypothesis — multiple outcomes, multiple subgroups, multiple specifications, multiple time periods — the probability of finding at least one spurious "significant" result rises rapidly with the number of tests. Without correction, your paper's headline finding might be nothing more than a statistical accident.

Why Testing Many Hypotheses Inflates False Positives

Under the null hypothesis of no effect, each individual test has a 5% probability of a false positive (a Type I error). For $m$ independent tests, the probability of at least one false positive — the — is:

\text{FWER} = 1 - (1 - \alpha)^m

For $\alpha = 0.05$ :

Number of tests ( $m$ )	Probability of at least one false positive
1	5%
5	23%
10	40%
20	64%
50	92%
100	99.4%

With 20 tests, you are more likely than not to find a false positive. With 100 tests, it is virtually certain. (The table assumes independent tests. With positively correlated tests — as when outcomes are highly related — the FWER is somewhat lower than these numbers but still rises rapidly with the number of tests; this is exactly why correlation-aware procedures like Romano-Wolf recover power that Bonferroni gives up.)

Two Philosophies: FWER vs. FDR

Before choosing a correction method, you need to decide what you are trying to control.

Familywise Error Rate (FWER)

Control the probability of making even one false positive across all tests. FWER control is the strictest standard. You use FWER control when even a single false claim is unacceptable — clinical trials, policy decisions, confirmatory analyses.

False Discovery Rate (FDR)

Control the expected proportion of false positives among all rejected hypotheses. If you reject 20 hypotheses and allow = 0.05, you expect about 1 of those 20 to be a false discovery. FDR is less conservative and more powerful. You use FDR control when you are doing exploratory analysis, testing many outcomes, or when some false positives are tolerable.

The Methods

1. Bonferroni Correction

The simplest approach. Divide your significance threshold by the number of tests:

\alpha_{\text{adj}} = \frac{\alpha}{m}

Equivalently, multiply each p-value by $m$ and compare to $\alpha$ .

Pros: Dead simple. Valid under any dependence structure. Cons: Extremely conservative, especially with many correlated tests. If your 20 outcomes are all measuring similar things (and thus correlated), overcorrects badly.

2. Holm Step-Down Correction

A strict improvement over Bonferroni that is always at least as powerful, yet still controls FWER.

Procedure:

Sort your $m$ p-values from smallest to largest: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Starting from the smallest, reject $H_{(k)}$ if $p_{(k)} \leq \frac{\alpha}{m - k + 1}$
Stop at the first non-rejection; do not reject any remaining hypotheses.

The threshold becomes less stringent as you work through the sorted list, giving you more power than Bonferroni.

3. Benjamini-Hochberg (BH) FDR Correction

Controls the false discovery rate rather than the familywise error rate (Benjamini & Hochberg, 1995).

Procedure:

Sort p-values from smallest to largest: $p_{(1)} \leq \cdots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \cdot q$ , where $q$ is the desired FDR level
Reject all hypotheses $H_{(1)}, \ldots, H_{(k)}$

The procedure is substantially more powerful than Bonferroni or Holm, but it controls a weaker error rate. The procedure as stated controls the FDR under independence or positive regression dependence (PRDS); under arbitrary dependence, the threshold should be made stricter (harder to reject) by dividing by the harmonic sum: $\frac{k}{m} \cdot \frac{q}{\sum_{i=1}^m 1/i}$ — the Benjamini-Yekutieli (2001) correction. Since $\sum_{i=1}^m 1/i \approx \ln(m) + \gamma$ , BY is substantially more conservative than BH (e.g., factor ~5.2 for $m=100$ ).

4. Anderson (2008) Sharpened q-Values

Anderson (2008) popularized the two-stage sharpened q-values of Benjamini, Krieger & Yekutieli (2006) in the economics context. The procedure estimates the share of true nulls ( $\pi_0$ ) from the observed p-value distribution and uses that estimate to sharpen the BH thresholds, recovering power when many hypotheses are false.

The key insight: the standard BH procedure assumes the worst case — all nulls could be true. If $\hat{\pi}_0 < 1$ , you can sharpen the threshold by $1/\hat{\pi}_0$ and gain power. Anderson provides a Stata do-file and reference algorithm; equivalent implementations are straightforward in R or Python.

5. Romano-Wolf Stepdown

The most widely recommended method for FWER control in applied economics. Romano and Wolf (2005) developed a stepdown procedure that uses resampling (bootstrap or randomization) to capture the joint dependence structure of the test statistics.

Why it matters: Unlike Bonferroni and Holm, Romano-Wolf accounts for the correlation between test statistics. If your outcomes are highly correlated (as is typical), the effective number of independent tests is much smaller than the nominal count, and Romano-Wolf exploits this to give you more power.

Procedure (simplified):

Compute all test statistics from the original data
Bootstrap (or permute) the data many times, recomputing all test statistics each time
Use the joint distribution of the bootstrapped statistics to compute adjusted p-values
Apply a stepdown algorithm that sequentially removes rejected hypotheses

How to Do It: Code

1# Raw p-values from multiple tests
2p_values <- c(0.003, 0.012, 0.041, 0.052, 0.089, 0.210, 0.430)
3
4# Bonferroni
5p.adjust(p_values, method = "bonferroni")
6
7# Holm
8p.adjust(p_values, method = "holm")
9
10# Benjamini-Hochberg FDR
11p.adjust(p_values, method = "BH")
12
13# Compare all methods
14data.frame(
15raw = p_values,
16bonferroni = p.adjust(p_values, method = "bonferroni"),
17holm = p.adjust(p_values, method = "holm"),
18bh = p.adjust(p_values, method = "BH")
19)

Sharpened q-Values and Romano-Wolf Stepdown

1# Requires: wildrwolf (install.packages("wildrwolf"))
2library(wildrwolf)
3
4# Romano-Wolf stepdown adjusted p-values
5# Fit models for each outcome
6models <- list(
7lm(outcome1 ~ treatment + x1 + x2, data = df),
8lm(outcome2 ~ treatment + x1 + x2, data = df),
9lm(outcome3 ~ treatment + x1 + x2, data = df),
10lm(outcome4 ~ treatment + x1 + x2, data = df)
11)
12
13# Romano-Wolf adjusted p-values via wild bootstrap
14rw_result <- rwolf(
15models = models,
16param = "treatment",
17B = 1000,
18seed = 12345
19)
20print(rw_result)

Interactive: Watching False Discoveries Accumulate

Set the share of true effects to zero and run 20 tests. You will almost certainly find at least one "significant" result. Now turn on corrections and watch the false positives disappear.

Compare the performance of Bonferroni, Holm, and Benjamini-Hochberg corrections side by side. Adjust the number of tests, fraction of true nulls, and effect size to see how each method trades off power against false discovery control:

How to Report Multiple Testing Corrections

A good reporting practice includes:

State how many tests you run and how you group them into families.
Report both unadjusted and adjusted p-values (or q-values) side by side.
Specify the method and why you chose it (FWER vs. FDR, accounting for correlation or not).
Define families clearly. Not every test in your paper needs to be in the same family. Group tests that address the same question or the same set of outcomes. Primary outcomes are one family; secondary outcomes may be another; subgroup analyses may be a third. When designing your study, a pre-analysis plan that specifies family groupings in advance adds credibility.

Example table format:

Outcome	Coefficient	SE	Raw p-value	Romano-Wolf p-value	BH q-value
Math scores	0.15	0.05	0.003	0.012	0.009
Reading scores	0.08	0.04	0.041	0.089	0.054
Attendance	0.03	0.02	0.089	0.210	0.104

Common Mistakes

Pitfalls to avoid

Not correcting at all. If you test 15 outcomes and report only the significant ones without adjustment, you are overstating your evidence. Referees increasingly expect corrections.
Correcting across the entire paper. you typically do not need to treat every test in your paper as part of the same family. Group related tests together. Your primary outcomes are one family; balance checks are another; heterogeneity analyses may be a third.
Using Bonferroni when tests are correlated. Bonferroni controls FWER under any dependence structure, so it is never invalid — but with positively correlated outcomes (the typical case), it gives up more power than necessary. Resampling-based methods like Romano-Wolf exploit the joint dependence of the test statistics and recover power; Holm step-down is uniformly more powerful than Bonferroni at no cost. Use Bonferroni when you cannot resample (e.g., correlations are unknown and analytic alternatives are unavailable).
Confusing FWER and FDR. They control different things. An FDR-adjusted q-value of 0.05 does not mean there is a 5% chance that this particular result is a false positive. It means that among all results you call significant, you expect 5% to be false.
Forgetting about the tests you did not report. If you tested 10 subgroups and only reported the 2 that were significant, the "garden of forking paths" problem is severe. Pre-registration helps enormously here.
Applying corrections to a single primary outcome. If you have one pre-specified primary outcome and several secondary outcomes, you typically do not need to adjust the primary outcome for the secondary tests. Corrections apply within families.

Concept Check

You test the effect of a program on 10 independent outcomes, all of which are truly zero. You use alpha = 0.05 with no correction. What is the approximate probability of finding at least one significant result?

5%40%50%10%

Paper Library

Has replication code

Foundational (6)

Anderson, M. L. (2008). Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects.

Journal of the American Statistical AssociationDOI: 10.1198/016214508000000841

Anderson proposes using summary index tests and familywise error rate corrections to address multiple inference in program evaluation. Reanalyzing the Abecedarian, Perry Preschool, and Early Training Projects, he finds that girls garner substantial short- and long-term benefits from early interventions, but there are no significant long-term benefits for boys after correcting for multiple testing.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

Journal of the Royal Statistical Society: Series BDOI: 10.1111/j.2517-6161.1995.tb02031.x

Benjamini and Hochberg introduce the false discovery rate (FDR) as an alternative to family-wise error rate control. Their step-up procedure for controlling FDR is less conservative than Bonferroni while still providing meaningful protection against false positives, and has become the standard in many fields.

Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità.

Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze

Bonferroni develops the classical correction for multiple comparisons, which controls the family-wise error rate by dividing the significance level by the number of tests. While conservative, the Bonferroni correction remains widely used due to its simplicity and broad applicability.

Clarke, D., Romano, J. P., & Wolf, M. (2020). The Romano-Wolf Multiple-Hypothesis Correction in Stata.

Stata JournalDOI: 10.1177/1536867X20976314

Clarke, Romano, and Wolf develop a Stata implementation of the Romano-Wolf stepwise multiple testing correction, which controls the family-wise error rate while accounting for the dependence structure among test statistics via resampling. This correction is more powerful than Bonferroni or Holm procedures when test statistics are correlated, which is the typical case in applied research with related outcomes. The rwolf command provides applied researchers with an accessible tool for rigorous multiple hypothesis testing.

Romano, J. P., & Wolf, M. (2005). Stepwise Multiple Testing as Formalized Data Snooping.

EconometricaDOI: 10.1111/j.1468-0262.2005.00615.x

Romano and Wolf develop a stepwise multiple testing procedure that controls the family-wise error rate while being less conservative than Bonferroni by resampling from the joint distribution of test statistics. Their method accounts for the correlation structure among tests and is widely used in economics.

Westfall, P. H., & Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment.

Wiley

Westfall and Young develop resampling-based methods for multiple testing that account for the dependence structure among test statistics. Their permutation-based step-down procedure is less conservative than Bonferroni and becomes a standard reference for multiple testing adjustments in applied research.

Application (2)

Casey, K., Glennerster, R., & Miguel, E. (2012). Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qje027

Casey, Glennerster, and Miguel pre-registered their analysis plan for a community-driven development program in Sierra Leone and apply multiple testing corrections (including the Westfall-Young step-down procedure and family-wise error rate adjustments) across outcome families. This paper is one of the most prominent examples of rigorous multiple testing adjustment in a field experiment, demonstrating that many individually significant effects lose significance after correction.

Haushofer, J., & Shapiro, J. (2016). The Short-Term Impact of Unconditional Cash Transfers to the Poor: Experimental Evidence from Kenya.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjw025

Haushofer and Shapiro evaluate GiveDirectly's unconditional cash transfer program in Kenya, testing effects across many outcome domains including consumption, assets, food security, health, and psychological well-being. They applied FWER corrections with bootstrapped p-values across outcome families, providing a model for how to handle multiple testing transparently in large-scale randomized evaluations. A 2017 erratum (QJE 132(4): 2057–2060) corrected the FWER-adjusted p-values in Tables I and II, which had used insufficient bootstrap iterations.

Survey (1)

List, J. A., Shaikh, A. M., & Xu, Y. (2019). Multiple Hypothesis Testing in Experimental Economics.

Experimental EconomicsDOI: 10.1007/s10683-018-09597-5

List, Shaikh, and Xu provide practical guidance on addressing multiple hypothesis testing in experimental economics. They compare various correction methods including Bonferroni, Holm, and FDR procedures, and demonstrate their application to field experiments with multiple outcome variables.

When to Use Multiple Testing Corrections#

The Problem of Too Many Tests#

Why Testing Many Hypotheses Inflates False Positives#

Two Philosophies: FWER vs. FDR#

Familywise Error Rate (FWER)#

False Discovery Rate (FDR)#

The Methods#

1. Bonferroni Correction#

2. Holm Step-Down Correction#

3. Benjamini-Hochberg (BH) FDR Correction#

4. Anderson (2008) Sharpened q-Values#

5. Romano-Wolf Stepdown#

How to Do It: Code#

Sharpened q-Values and Romano-Wolf Stepdown#

Interactive: Watching False Discoveries Accumulate#

How to Report Multiple Testing Corrections#

Common Mistakes#

Concept Check#