MethodAtlas
Estimation/Reporting Stage

Multiple Hypothesis Testing

Testing many hypotheses inflates false positives. Bonferroni, Holm, BH-FDR, and Romano-Wolf corrections.

When to Use Multiple Testing Corrections

Apply multiple testing corrections whenever your analysis involves more than one hypothesis test — multiple outcomes, multiple subgroups, multiple specifications, or multiple time periods. The correction is needed regardless of whether the tests were pre-registered. Common settings include: multi-arm experiments, heterogeneity analyses, event study pre-trend tests, and specification curve analyses.


The Problem of Too Many Tests

Imagine you run an experiment evaluating a tutoring program. You measure its effect on math scores, reading scores, attendance, behavior, self-confidence, parental involvement, peer relationships, and teacher ratings. That list is eight outcomes. Even if the program does absolutely nothing, with eight independent tests at the 5% level, you have roughly a 34% chance of finding at least one "significant" result.

If this probability does not alarm you, it should.

This inflation is the multiple testing problem, and it is everywhere in empirical research. Any time you test more than one hypothesis — multiple outcomes, multiple subgroups, multiple specifications, multiple time periods — the probability of finding at least one spurious "significant" result rises rapidly with the number of tests. Without correction, your paper's headline finding might be nothing more than a statistical accident.


Why Testing Many Hypotheses Inflates False Positives

Under the null hypothesis of no effect, each individual test has a 5% probability of a false positive (a Type I error). For mm independent tests, the probability of at least one false positive — the — is:

FWER=1(1α)m\text{FWER} = 1 - (1 - \alpha)^m

For α=0.05\alpha = 0.05:

Number of tests (mm)Probability of at least one false positive
15%
523%
1040%
2064%
5092%
10099.4%

With 20 tests, you are more likely than not to find a false positive. With 100 tests, it is virtually certain.


Two Philosophies: FWER vs. FDR

Before choosing a correction method, you need to decide what you are trying to control.

Familywise Error Rate (FWER)

Control the probability of making even one false positive across all tests. FWER control is the strictest standard. You use FWER control when even a single false claim is unacceptable — clinical trials, policy decisions, confirmatory analyses.

False Discovery Rate (FDR)

Control the expected proportion of false positives among all rejected hypotheses. If you reject 20 hypotheses and allow FDR = 0.05, you expect about 1 of those 20 to be a false discovery. FDR is less conservative and more powerful. You use FDR control when you are doing exploratory analysis, testing many outcomes, or when some false positives are tolerable.


The Methods

1. Bonferroni Correction

The simplest approach. Divide your significance threshold by the number of tests:

αadj=αm\alpha_{\text{adj}} = \frac{\alpha}{m}

Equivalently, multiply each p-value by mm and compare to α\alpha.

Pros: Dead simple. Valid under any dependence structure. Cons: Extremely conservative, especially with many correlated tests. If your 20 outcomes are all measuring similar things (and thus correlated), Bonferroni overcorrects badly.

2. Holm Step-Down Correction

A strict improvement over Bonferroni that is always at least as powerful, yet still controls FWER.

Procedure:

  1. Sort your mm p-values from smallest to largest: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Starting from the smallest, reject H(k)H_{(k)} if p(k)αmk+1p_{(k)} \leq \frac{\alpha}{m - k + 1}
  3. Stop at the first non-rejection; do not reject any remaining hypotheses.

The threshold becomes less stringent as you work through the sorted list, giving you more power than Bonferroni.

3. Benjamini-Hochberg (BH) FDR Correction

Controls the false discovery rate rather than the familywise error rate.

(Benjamini & Hochberg, 1995)

Procedure:

  1. Sort p-values from smallest to largest: p(1)p(m)p_{(1)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmqp_{(k)} \leq \frac{k}{m} \cdot q, where qq is the desired FDR level
  3. Reject all hypotheses H(1),,H(k)H_{(1)}, \ldots, H_{(k)}

The BH procedure is substantially more powerful than Bonferroni or Holm, but it controls a weaker error rate.

4. Anderson (2008) Sharpened q-Values

Anderson (2008) adapted the BH procedure for the economics context, producing "sharpened" q-values that account for the share of true null hypotheses in the family of tests.

(Anderson, 2008)

The key insight: the standard BH procedure assumes all nulls could be true. If you can estimate the share of true nulls (π0\pi_0), you can sharpen the threshold and gain power. Anderson provides a Stata do-file and algorithm for computing these sharpened q-values.

5. Romano-Wolf Stepdown

The most widely recommended method for FWER control in applied economics. Romano and Wolf (2005) developed a stepdown procedure that uses resampling (bootstrap or randomization) to capture the joint dependence structure of the test statistics.

(Romano & Wolf, 2005)

Why it matters: Unlike Bonferroni and Holm, Romano-Wolf accounts for the correlation between test statistics. If your outcomes are highly correlated (as is typical), the effective number of independent tests is much smaller than the nominal count, and Romano-Wolf exploits this to give you more power.

Procedure (simplified):

  1. Compute all test statistics from the original data
  2. Bootstrap (or permute) the data many times, recomputing all test statistics each time
  3. Use the joint distribution of the bootstrapped statistics to compute adjusted p-values
  4. Apply a stepdown algorithm that sequentially removes rejected hypotheses

How to Do It: Code

# Raw p-values from multiple tests
p_values <- c(0.003, 0.012, 0.041, 0.052, 0.089, 0.210, 0.430)

# Bonferroni
p.adjust(p_values, method = "bonferroni")

# Holm
p.adjust(p_values, method = "holm")

# Benjamini-Hochberg FDR
p.adjust(p_values, method = "BH")

# Compare all methods
data.frame(
raw = p_values,
bonferroni = p.adjust(p_values, method = "bonferroni"),
holm = p.adjust(p_values, method = "holm"),
bh = p.adjust(p_values, method = "BH")
)

Stata: Anderson sharpened q-values

* Download Michael Anderson's do-file from his website
* Input: a variable containing raw p-values
* The do-file computes sharpened q-values

* Example usage (after setting up the q-value do-file):
gen pval = .
replace pval = 0.003 in 1
replace pval = 0.012 in 2
replace pval = 0.041 in 3
replace pval = 0.052 in 4
do "fdr_sharpened_qvalues.do"

Interactive: Watching False Discoveries Accumulate

Interactive Simulation

False Discovery Rate Simulator

Run multiple independent hypothesis tests when the true effect is zero. Watch how the number of false positives grows with the number of tests. Toggle corrections on and off to see how Bonferroni, Holm, and BH protect you.

02.695.378.06Simulated ValueNumber ofSignificanc…Share ofParameters
1100
0.010.1
01

Set the share of true effects to zero and run 20 tests. You will almost certainly find at least one "significant" result. Now turn on corrections and watch the false positives disappear.

Interactive Exercise

P-Hacking Arcade

You have a dataset where the true treatment effect is exactly zero. Try different analysis specifications to find a "significant" result (p < 0.05). Each combination of controls, sample restrictions, and outcome transformations constitutes one specification.

Running Tally

Specs tried
0 / 18
Significant (p < 0.05)
0
False discovery rate
--
Best (lowest) p-value
--
Control Variables
Sample Restriction
Outcome Transformation

Why this matters: When researchers test multiple specifications and report only the "best" one, the reported p-value no longer reflects the true Type I error rate. With 18 possible specifications, the probability that at least one yields p < 0.05 under the null is 1 - (0.95)18 = 60.3%. Corrections like Bonferroni, Holm, or Benjamini-Hochberg account for this multiplicity.


How to Report Multiple Testing Corrections

A good reporting practice includes:

  1. State how many tests you run and how you group them into families.
  2. Report both unadjusted and adjusted p-values (or q-values) side by side.
  3. Specify the method and why you chose it (FWER vs. FDR, accounting for correlation or not).
  4. Define families clearly. Not every test in your paper needs to be in the same family. Group tests that address the same question or the same set of outcomes. Primary outcomes are one family; secondary outcomes may be another; subgroup analyses may be a third. When designing your study, a pre-analysis plan that specifies family groupings in advance adds credibility.

Example table format:

OutcomeCoefficientSERaw p-valueRomano-Wolf p-valueBH q-value
Math scores0.150.050.0030.0120.009
Reading scores0.080.040.0410.0890.054
Attendance0.030.020.0890.2100.104

Common Mistakes


Concept Check

Concept Check

You test the effect of a program on 10 independent outcomes, all of which are truly zero. You use alpha = 0.05 with no correction. What is the approximate probability of finding at least one significant result?


Paper Library

Foundational (5)

Bonferroni, C. (1936). Teoria Statistica delle Classi e Calcolo delle Probabilita.

Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze

Bonferroni developed the classical correction for multiple comparisons, which controls the family-wise error rate by dividing the significance level by the number of tests. While conservative, the Bonferroni correction remains widely used due to its simplicity and broad applicability.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

Journal of the Royal Statistical Society: Series BDOI: 10.1111/j.2517-6161.1995.tb02031.x

Benjamini and Hochberg introduced the false discovery rate (FDR) as an alternative to family-wise error rate control. Their step-up procedure for controlling FDR is less conservative than Bonferroni while still providing meaningful protection against false positives, and has become the standard in many fields.

Romano, J. P., & Wolf, M. (2005). Stepwise Multiple Testing as Formalized Data Snooping.

Romano and Wolf developed a stepwise multiple testing procedure that controls the family-wise error rate while being less conservative than Bonferroni by resampling from the joint distribution of test statistics. Their method accounts for the correlation structure among tests and is widely used in economics.

Anderson, M. L. (2008). Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects.

Journal of the American Statistical AssociationDOI: 10.1198/016214508000000841

Anderson proposed using index tests and the Westfall-Young step-down procedure to address multiple testing in program evaluation. He demonstrated that many previously reported significant gender differences in early childhood interventions disappeared after proper multiple testing corrections.

Westfall, P. H., & Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment.

Wiley

Westfall and Young developed resampling-based methods for multiple testing that account for the dependence structure among test statistics. Their permutation-based step-down procedure is less conservative than Bonferroni and became a standard reference for multiple testing adjustments in applied research.

Application (5)

List, J. A., Shaikh, A. M., & Xu, Y. (2019). Multiple Hypothesis Testing in Experimental Economics.

Experimental EconomicsDOI: 10.1007/s10683-018-09597-5

List, Shaikh, and Xu provided practical guidance on addressing multiple hypothesis testing in experimental economics. They compared various correction methods including Bonferroni, Holm, and FDR procedures, and demonstrated their application to field experiments with multiple outcome variables.

Hollenbeck, J. R., & Wright, P. M. (2017). Harking, Sharking, and Tharking: Making the Case for Post Hoc Analysis of Scientific Data.

Journal of ManagementDOI: 10.1177/0149206316679487

Hollenbeck and Wright discussed the multiple testing problem in management research in the context of post hoc analyses. They argued for transparency about data exploration while maintaining statistical rigor, emphasizing the importance of adjusting for multiple comparisons when testing is exploratory.

Casey, K., Glennerster, R., & Miguel, E. (2012). Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qje027

Casey, Glennerster, and Miguel pre-registered their analysis plan for a community-driven development program in Sierra Leone and applied multiple testing corrections (including the Westfall-Young step-down procedure and family-wise error rate adjustments) across outcome families. This paper is one of the most prominent examples of rigorous multiple testing adjustment in a field experiment, demonstrating that many individually significant effects lose significance after correction.

Haushofer, J., & Shapiro, J. (2016). The Short-Term Impact of Unconditional Cash Transfers to the Poor: Experimental Evidence from Kenya.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjw025

Haushofer and Shapiro evaluated GiveDirectly's unconditional cash transfer program in Kenya, testing effects across many outcome domains including consumption, assets, food security, health, and psychological well-being. They rigorously applied FDR corrections (Benjamini-Hochberg) across outcome families, providing a model for how to handle multiple testing transparently in large-scale randomized evaluations.

Clarke, D., Romano, J. P., & Wolf, M. (2020). The Romano-Wolf Multiple-Hypothesis Correction in Stata.

Clarke, Romano, and Wolf developed a Stata implementation of the Romano-Wolf stepwise multiple testing correction, providing applied researchers with an accessible tool for controlling the family-wise error rate while accounting for the dependence structure among test statistics.