Power Analysis & Sample-Size Planning
How large a sample do you need to detect your effect? Power calculations prevent underpowered studies.
When to Use Power Analysis
Conduct a power analysis any time you are designing a study where you will collect new data — randomized experiments, surveys, field trials, or lab studies. It is also valuable when planning a quasi-experimental analysis and you want to know whether your sample is large enough to detect policy-relevant effects. Common settings include: RCTs with individual or cluster randomization, difference-in-differences designs with a known policy change, regression discontinuity designs near a cutoff, and grant proposals (where reviewers will expect a power calculation).
The Question You Must Answer Before Collecting Data
You are designing a study. You have a treatment, an outcome, and a plan. The question that will determine whether your study succeeds or fails — before a single observation is collected — is this: how large does your sample need to be?
Too small, and you will not be able to detect the effect even if it exists. You will waste time, money, and possibly the goodwill of participants. Too large, and you are spending resources that could have been used elsewhere. Power analysis is the tool that gets you to the right number.
And yet, a remarkable fraction of studies in the social sciences are underpowered. Ioannidis et al. (2017) surveyed the economics literature and found that the median statistical power across 64,000 estimates was just 18% — meaning most studies had less than a one-in-five chance of detecting the effects they claimed to be studying.
(Ioannidis et al., 2017)Low statistical power is not a minor technical problem. Underpowered studies produce noisy estimates that are uninformative. Worse, the significant results that do emerge from underpowered literatures tend to be exaggerated — a phenomenon known as the "winner's curse" or Type M error. If your field is full of underpowered studies, the published estimates systematically overstate the truth.
The Four Ingredients
Every power calculation involves four quantities. Fix any three, and the fourth is determined:
- Significance level (): The probability of a Type I error (false positive). Conventionally 0.05.
- Power (): The probability of detecting a true effect — rejecting the null when it is false. Conventionally 0.80 or 0.90.
- Effect size: How large the treatment effect is, usually expressed in standard deviations (Cohen's ) or in the natural units of the outcome.
- Sample size (): The number of observations.
The fundamental trade-off: holding everything else fixed, detecting smaller effects requires larger samples.
(Cohen, 1988)The Formulas
Two-Sample Test (Simple RCT)
For a two-sample comparison of means with equal group sizes ( per group, total):
Don't worry about the notation yet — here's what this means in words: The MDE formula follows directly from the rejection rule of a two-sided t-test. You need the true effect to shift the sampling distribution far enough from zero that the test rejects with the desired probability.
Under the null (), the test statistic follows approximately a standard normal. You reject when .
Under the alternative (), the test statistic has a non-central distribution centered at . Power is:
Setting this to the desired power and solving:
This expression is the MDE. Solving for :
For and power = 0.80, the critical values are and , giving .
In terms of the standardized effect size :
To detect , you need about 400 per group. For , about 64 per group.
The minimum detectable effect (MDE):
The required sample size per group:
The is often more useful than the required sample size, because it answers the practical question directly: "Given the sample I can afford, what is the smallest effect I can reliably detect?"
Power for Cluster-Randomized Designs
When treatment is assigned at the cluster level (schools, villages, clinics), the effective sample size is much smaller than the total number of individuals. The key parameter is the , denoted : the fraction of total outcome variance that is between clusters rather than within clusters.
The is:
where is the average cluster size. The MDE for a cluster-randomized trial with clusters per arm is:
(Duflo et al., 2007)Power for Difference-in-Differences
Power analysis for DiD designs requires additional considerations:
- Serial correlation. Outcomes measured over time within the same unit are correlated, which affects the effective sample size. Bertrand et al. (2004) showed that ignoring serial correlation leads to dramatically inflated rejection rates.
- Number of pre- and post-periods. More pre-treatment periods improve precision by pinning down the counterfactual trend. McKenzie (2012) shows that ANCOVA specifications exploiting baseline data are almost always more powerful than simple DiD.
- Fraction treated. In DiD the fraction of treated units is often fixed by the policy. Power depends on this fraction — the optimal split is 50/50, and power drops as the split becomes more unequal.
The effective MDE for a DiD with groups, time periods, fraction treated , and ICC :
Strategies for Improving Power on a Fixed Budget
If your budget is fixed and your sample is limited, several strategies can improve power without adding observations:
- Stratified randomization. Block on strong predictors of the outcome. This blocking ensures balance and reduces residual variance.
- ANCOVA with baseline covariates. Controlling for the pre-treatment value of the outcome dramatically reduces variance. McKenzie (2012) shows this is almost always more powerful than a simple difference in means or a DiD.
- Multiple post-treatment measurements. Averaging over multiple follow-up rounds reduces noise.
- Optimal allocation. If treatment costs differ from control costs, unequal allocation (e.g., 2:1 treatment-to-control) can improve efficiency.
- Reduce attrition. Every lost observation reduces power. Invest in tracking and retention. When differential attrition is unavoidable, Lee bounds can provide valid inference under a monotonicity assumption.
- Choose the right test statistic. A studentized statistic from a regression with covariates can be substantially more powerful than a raw difference in means. When conventional asymptotics are unreliable (e.g., few clusters), randomization inference provides a valid alternative that can be more powerful with the right test statistic.
How to Choose the Effect Size
Choosing the effect size is the hardest part, and there is no formula for it:
- Prior studies. What have previous papers found for similar interventions? Beware publication bias — published effects are systematically larger than true effects.
- Pilot data. If you have run a small pilot, use it to estimate the effect size and variance. But pilot estimates are noisy, so treat them as rough guides.
- Policy relevance. What is the smallest effect that would matter for policy? If a job training program must increase earnings by at least $500/year to justify its cost, power for that threshold.
- Cohen's conventions. Small (), medium (), large (). These benchmarks are widely used but often misapplied. They are rules of thumb, not scientific facts.
Interactive: Sample Size Explorer
Power & Sample Size Calculator
Adjust the effect size, significance level, and desired power to see how the required sample size changes. Watch how the sample size explodes as the effect size shrinks — detecting small effects requires very large studies. Toggle the ICC above zero to see how clustering inflates sample requirements.
Try setting the effect size to 0.1 (a "small" effect in many social science settings). Notice how much sample you need. Now set it to 0.5 and watch the required sample plummet. Then set the ICC to 0.10 and the cluster size to 50 — the required sample jumps dramatically.
Power Calculator
Explore how effect size, sample size, and significance level jointly determine statistical power. The power curve shows the probability of detecting a true effect as a function of sample size.
Results
| Target Power | Required n per group | Total N |
|---|---|---|
| 80% | 393 | 786 |
| 90% | 526 | 1,052 |
| 95% | 650 | 1,300 |
Cohen's d: 0.2 small, 0.5 medium, 0.8 large
Number of observations in each group
Type I error rate (false positive rate)
Minimum Detectable Effect
With n=100 per group, the minimum detectable effect at 80% power is d=0.396
Underpowered design. With only 29.3% power, this study has a 70.7% chance of failing to detect a true effect of d=0.20. You would need at least n=393 per group for 80% power.
How to Do It: Code
Basic Power Analysis
library(pwr)
# Two-sample t-test: sample size for d = 0.2, power = 0.80
pwr.t.test(
d = 0.20, # standardized effect size
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided"
)
# Returns n per group
# Compute MDE given a fixed sample size
pwr.t.test(
n = 500, # per group
sig.level = 0.05,
power = 0.80,
type = "two.sample"
)
# Returns d (the MDE in SD units)Simulation-Based and Cluster Power
library(DeclareDesign)
# Define the full design
design <- declare_model(
N = 500,
U = rnorm(N),
potential_outcomes(Y ~ 0.3 * Z + U)
) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N, m = 250)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")
# Diagnose power via simulation (flexible for any design)
diagnosis <- diagnose_design(design, sims = 500)
diagnosisHow to Report Power Analysis
A complete power analysis section includes:
- The target effect size and its justification. Where did the number come from? Prior studies? Policy relevance? An MDE argument?
- Key parameters. Standard deviation of the outcome (ideally from pilot data or similar studies), ICC if clustered, significance level, desired power.
- The computed sample size (or MDE, if sample is fixed).
- Sensitivity. Show how the required sample changes under different assumptions about effect size, ICC, or attrition.
Example write-up:
We compute the minimum detectable effect for our main outcome (employment at 12 months) given our sample of 2,000 individuals randomized 1:1 to treatment and control. Based on administrative data, the control group employment rate is 55% with a standard deviation of 0.50. At and 80% power, our MDE is 4.5 percentage points (an 8% increase over the control mean). This MDE is comparable to the 5-percentage-point effects found by Card et al. (2018) in their meta-analysis of active labor market programs, suggesting our study is adequately powered for policy-relevant effects.
Common Mistakes
Concept Check
You are designing a cluster-randomized trial in 30 schools (15 treatment, 15 control), with 200 students per school. The outcome is a test score with ICC = 0.10. A colleague says: 'We have 6,000 students — that number is a huge sample, we must be well-powered.' What is wrong with this reasoning?
Paper Library
Foundational (5)
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences.
Cohen's foundational textbook introduced the concepts of effect size, statistical power, and sample size determination that became standard in the behavioral sciences. He provided power tables and conventions for small, medium, and large effect sizes that remain widely used across disciplines.
Bloom, H. S. (1995). Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.
Bloom introduced the minimum detectable effect (MDE) framework, which reports the smallest effect size a study can reliably detect given its design and sample size. This approach is now the standard way to discuss statistical power in program evaluation and experimental economics.
McKenzie, D. (2012). Beyond Baseline and Follow-Up: The Case for More T in Experiments.
McKenzie showed that collecting multiple rounds of data dramatically increases statistical power in randomized experiments. He demonstrated that ANCOVA with baseline data and difference-in-differences with multiple time periods can substantially reduce the required sample size, which is particularly valuable in development economics.
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How Much Should We Trust Differences-in-Differences Estimates?.
Bertrand, Duflo, and Mullainathan showed that standard errors in difference-in-differences designs are often severely underestimated due to serial correlation, leading to dramatically over-rejected null hypotheses. Their paper highlighted the importance of proper inference and power considerations in panel data settings.
Hoenig, J. M., & Heisey, D. M. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.
Hoenig and Heisey demonstrated that post hoc (observed) power calculations are fundamentally flawed because they are a monotone function of the p-value and add no information beyond the test result itself. This paper is essential reading for understanding why power analysis must be conducted before data collection.
Application (4)
Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, H. (2017). The Power of Bias in Economics Research.
Ioannidis, Stanley, and Doucouliagos conducted a large-scale assessment of statistical power in economics research and found that the median power to detect typical effect sizes was only 18%. They documented widespread underpowering and publication bias, highlighting the importance of ex ante power analysis.
Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.
Gelman and Carlin extended traditional power analysis by introducing Type S (sign) errors (the probability a significant estimate has the wrong sign) and Type M (magnitude) errors (the expected exaggeration ratio of significant estimates). These concepts provide a richer understanding of what happens in underpowered studies.
Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect Size and Power in Assessing Moderating Effects of Categorical Variables Using Multiple Regression: A 30-Year Review.
Aguinis and colleagues reviewed 30 years of moderator analysis in applied psychology and management, finding that most studies were severely underpowered to detect interaction effects. They provided guidelines for computing power for moderated regression, which is highly relevant to management researchers testing contingency hypotheses.
Muralidharan, K., Niehaus, P., & Sukhtankar, S. (2019). Building State Capacity: Evidence from Biometric Smartcards in India.
Muralidharan, Niehaus, and Sukhtankar conducted a large-scale cluster-randomized evaluation of biometric smartcards for welfare payments in India, featuring detailed ex ante power calculations for their primary and secondary outcomes across districts. The paper demonstrates best practices for power analysis in a complex cluster-randomized design, showing how minimum detectable effects were computed and reported to justify the experimental design.
Survey (1)
Duflo, E., Glennerster, R., & Kremer, M. (2007). Using Randomization in Development Economics Research: A Toolkit.
Duflo, Glennerster, and Kremer provided a comprehensive toolkit for designing and analyzing randomized experiments in development economics, with extensive coverage of power calculations. They demonstrated how to compute power for cluster-randomized designs, stratified experiments, and spillover designs.