MethodAtlas
Practice·Design Stage·10 min read
Design Stage

Power Analysis & Sample-Size Planning

How large a sample do you need to detect your effect? Power calculations prevent underpowered studies.

Reading Time~10 min read · 13 sections · 3 interactive exercises · 10 papers

When to Use Power Analysis

Conduct a power analysis any time you are designing a study where you will collect new data — randomized experiments, surveys, field trials, or lab studies. It is also valuable when planning a quasi-experimental analysis and you want to know whether your sample is large enough to detect policy-relevant effects. Common settings include: randomized controlled trials (RCTs) with individual or cluster randomization, difference-in-differences designs with a known policy change, regression discontinuity designs near a cutoff, and grant proposals (where reviewers will expect a power calculation).


The Key Question Before Collecting Data

You are designing a study. You have a treatment, an outcome, and a plan. The question that will determine whether your study succeeds or fails — before a single observation is collected — is this one: how large does your sample need to be?

Too small, and you will not be able to detect the effect even if it exists. You will waste time, money, and possibly the goodwill of participants. Too large, and you are spending resources that could have been used elsewhere. Power analysis is the tool that gets you to the right number.

And yet, a remarkable fraction of studies in the social sciences are underpowered. Ioannidis et al. (2017) surveyed the economics literature and found that the median across approximately 64,000 estimates from 159 empirical areas was just 18% — meaning most studies had less than a one-in-five chance of detecting the effects they claimed to be studying.

Low statistical power is not a minor technical problem. Underpowered studies produce noisy estimates that are uninformative. Worse, the significant results that do emerge from underpowered literatures tend to be exaggerated — a phenomenon known as the "winner's curse" or Type M error. If your field is full of underpowered studies, the published estimates systematically overstate the truth.


The Four Ingredients

Every power calculation involves four quantities. Fix any three, and the fourth is determined:

  1. Significance level (α\alpha): The probability of a Type I error (false positive). Conventionally 0.05.
  2. Power (1β1 - \beta): The probability of detecting a true effect — rejecting the null when it is false. Conventionally 0.80 or 0.90.
  3. Effect size: How large the treatment effect is, usually expressed in standard deviations (Cohen's dd) or in the natural units of the outcome.
  4. Sample size (NN): The number of observations.

The fundamental trade-off: holding everything else fixed, detecting smaller effects requires larger samples (Cohen, 1988).


The Formulas

Two-Sample Test (Simple RCT)

For a two-sample comparison of means with equal group sizes (nn per group, N=2nN = 2n total):

Don't worry about the notation yet — here's what this means in words: The MDE formula follows directly from the rejection rule of a two-sided t-test. You need the true effect to shift the sampling distribution far enough from zero that the test rejects with the desired probability.

Under the null (H0:μTμC=0H_0: \mu_T - \mu_C = 0), the test statistic T=(YˉTYˉC)/SET = (\bar{Y}_T - \bar{Y}_C) / \text{SE} follows approximately a standard normal. You reject when T>z1α/2|T| > z_{1-\alpha/2}.

Under the alternative (HA:μTμC=δH_A: \mu_T - \mu_C = \delta), the test statistic has a non-central distribution centered at δ/SE\delta / \text{SE}. Power is:

1β=P(δSEz1α/2>zβ)1 - \beta = P\left(\frac{\delta}{\text{SE}} - z_{1-\alpha/2} > -z_{\beta}\right)

Setting this to the desired power and solving:

δ=(z1α/2+z1β)SE=(z1α/2+z1β)σ2/n\delta = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \text{SE} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \sqrt{2/n}

This expression is the MDE. Solving for nn:

n=2σ2(z1α/2+z1β)2δ2n = \frac{2\sigma^2(z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}

For α=0.05\alpha = 0.05 and power = 0.80, the critical values are z0.9751.96z_{0.975} \approx 1.96 and z0.800.84z_{0.80} \approx 0.84, giving (z1α/2+z1β)27.85(z_{1-\alpha/2} + z_{1-\beta})^2 \approx 7.85.

In terms of the standardized effect size d=δ/σd = \delta / \sigma:

n16d2 per groupn \approx \frac{16}{d^2} \text{ per group}

To detect d=0.2d = 0.2, you need about 400 per group. For d=0.5d = 0.5, about 64 per group.

The minimum detectable effect (MDE):

MDE=(z1α/2+z1β)σ2n\text{MDE} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \cdot \sqrt{\frac{2}{n}}

The required sample size per group:

n=2σ2(z1α/2+z1β)2MDE2n = \frac{2\sigma^2(z_{1-\alpha/2} + z_{1-\beta})^2}{\text{MDE}^2}

The is often more useful than the required sample size, because it answers the practical question directly: "Given the sample I can afford, what is the smallest effect I can reliably detect?"


Power for Cluster-Randomized Designs

When treatment is assigned at the cluster level — at the level of schools, villages, or clinics — the effective sample size is much smaller than the total number of individuals. The key parameter is the , denoted ρ\rho: the fraction of total outcome variance that is between clusters rather than within clusters.

The is:

DEFF=1+(m1)ρ\text{DEFF} = 1 + (m - 1)\rho

where mm is the average cluster size. The MDE for a cluster-randomized trial with JJ clusters per arm is:

MDE=(z1α/2+z1β)σ2Jρ+1ρm\text{MDE} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \cdot \sqrt{\frac{2}{J}} \cdot \sqrt{\rho + \frac{1-\rho}{m}}

Power for Difference-in-Differences

Power analysis for DiD designs requires additional considerations:

  1. Serial correlation. Outcomes measured over time within the same unit are correlated, which affects the effective sample size. Bertrand et al. (2004) showed that ignoring serial correlation leads to dramatically inflated rejection rates.

  2. Number of pre- and post-periods. More pre-treatment periods improve precision by pinning down the counterfactual trend. McKenzie (2012) shows that analysis of covariance (ANCOVA) specifications exploiting baseline data are generally more powerful than simple DiD, especially when baseline autocorrelation is high.

  3. Fraction treated. In DiD the fraction of treated units is often fixed by the policy. Power depends on this fraction — the optimal split is 50/50, and power drops as the split becomes more unequal.

The effective MDE for a DiD with JJ groups, TT time periods, fraction treated pp, and ICC ρ\rho:

MDEDiD(z1α/2+z1β)σ1p(1p)J[ρ+1ρT]\text{MDE}_{\text{DiD}} \approx (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \cdot \sqrt{\frac{1}{p(1-p) \cdot J} \cdot \left[\rho + \frac{1 - \rho}{T}\right]}

Strategies for Improving Power on a Fixed Budget

If your budget is fixed and your sample is limited, several strategies can improve power without adding observations:

  1. . on strong predictors of the outcome. This blocking ensures balance and reduces residual variance.
  2. ANCOVA with baseline covariates. Controlling for the pre-treatment value of the outcome dramatically reduces variance. McKenzie (2012) shows that ANCOVA is generally more powerful than a simple difference in means or a DiD.
  3. Multiple post-treatment measurements. Averaging over multiple follow-up rounds reduces noise.
  4. Optimal allocation. If treatment costs differ from control costs, unequal allocation (e.g., 2:1 treatment-to-control) can improve efficiency.
  5. Reduce attrition. Every lost observation reduces power. Invest in tracking and retention. When differential attrition is unavoidable, Lee bounds can provide valid inference under a monotonicity assumption.
  6. Choose the right test statistic. A studentized statistic from a regression with covariates can be substantially more powerful than a raw difference in means. When conventional asymptotics are unreliable (e.g., few clusters), randomization inference provides a valid alternative that can be more powerful with the right test statistic.

How to Choose the Effect Size

Choosing the effect size is the hardest part, and there is no formula for it:

  1. Prior studies. What have previous papers found for similar interventions? Beware — published effects are systematically larger than true effects.
  2. Pilot data. If you have run a small pilot, use it to estimate the effect size and variance. But pilot estimates are noisy, so treat them as rough guides.
  3. Policy relevance. What is the smallest effect that would matter for policy? If a job training program must increase earnings by at least $500/year to justify its cost, power for that threshold.
  4. Cohen's conventions. Small (d=0.2d = 0.2), medium (d=0.5d = 0.5), large (d=0.8d = 0.8). These benchmarks are widely used but often misapplied. They are rules of thumb, not scientific facts.

Interactive: Sample Size Explorer

Try setting the effect size to 0.1 (a "small" effect in many social science settings). Notice how much sample you need. Now set it to 0.5 and watch the required sample plummet. Then set the ICC to 0.10 and the cluster size to 50 — the required sample jumps dramatically.


How to Do It: Code

Basic Power Analysis

# --- Step 1: Load the pwr package ---
# pwr provides analytic power calculations for common test types
library(pwr)

# --- Step 2: Compute required sample size per group ---
# Given: effect size d = 0.2, alpha = 0.05, power = 0.80
pwr.t.test(
d = 0.20,          # standardized effect size (Cohen's d)
sig.level = 0.05,  # Type I error rate
power = 0.80,      # desired probability of detecting a true effect
type = "two.sample",
alternative = "two.sided"
)
# Output: n = required observations per group (treatment and control)

# --- Step 3: Compute MDE given a fixed sample size ---
# Useful when your budget constrains N; answers "what can I detect?"
pwr.t.test(
n = 500,            # per group (fixed by budget)
sig.level = 0.05,
power = 0.80,
type = "two.sample"
)
# Output: d = minimum detectable effect in SD units
Requirespwr

Simulation-Based and Cluster Power

# --- Step 1: Simulation-based power with DeclareDesign ---
# DeclareDesign lets you simulate any research design and diagnose power
library(DeclareDesign)

# Declare the full design: model, estimand, assignment, estimator
design <- declare_model(
N = 500,                                   # sample size
U = rnorm(N),                              # individual-level noise
potential_outcomes(Y ~ 0.3 * Z + U)        # true ATE = 0.3
) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +  # target estimand
declare_assignment(Z = complete_ra(N, m = 250)) +  # 1:1 randomization
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")   # difference-in-means

# diagnose_design runs sims simulations and reports power, bias, RMSE
diagnosis <- diagnose_design(design, sims = 500)
diagnosis  # check the "Power" column for your test

# --- Step 2: Cluster-RCT power using clusterPower ---
# Analytic power formula for two-arm cluster-randomized trials
library(clusterPower)
crtpwr.2mean(
alpha = 0.05,
power = 0.80,
m = 50,         # individuals per cluster
n = 20,         # clusters per arm
d = 0.30,       # expected raw difference in means
icc = 0.05,     # intracluster correlation (between-cluster variance share)
varw = 1        # within-cluster variance
)
# Output: computed power (or required n if power is left NULL)

How to Report Power Analysis

A complete power analysis section includes:

  1. The target effect size and its justification. Where did the number come from? Prior studies? Policy relevance? An MDE argument?
  2. Key parameters. Standard deviation of the outcome (ideally from pilot data or similar studies), ICC if clustered, significance level, desired power.
  3. The computed sample size (or MDE, if sample is fixed).
  4. Sensitivity. Show how the required sample changes under different assumptions about effect size, ICC, or attrition.

Example write-up:

We compute the minimum detectable effect for our main outcome (employment at 12 months) given our sample of 2,000 individuals randomized 1:1 to treatment and control. Based on administrative data, the control group employment rate is 55% with a standard deviation of 0.50. At α=0.05\alpha = 0.05 and 80% power, our MDE is 4.5 percentage points (an 8% increase over the control mean). This MDE is comparable to the 5-percentage-point effects found by Card et al. (2018) in their meta-analysis of active labor market programs, suggesting our study is adequately powered for policy-relevant effects.


Common Mistakes


Concept Check

Concept Check

You are designing a cluster-randomized trial in 30 schools (15 treatment, 15 control), with 200 students per school. The outcome is a test score with ICC = 0.10. A colleague says: 'We have 6,000 students — that number is a huge sample, we must be well-powered.' What is wrong with this reasoning?


Paper Library

Foundational (5)

Bloom, H. S. (1995). Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.

Evaluation ReviewDOI: 10.1177/0193841X9501900504

Bloom introduces the minimum detectable effect (MDE) framework, which reports the smallest effect size a study can reliably detect given its design and sample size. This approach is now the standard way to discuss statistical power in program evaluation and experimental economics.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences.

Lawrence Erlbaum AssociatesDOI: 10.4324/9780203771587

Cohen's foundational textbook introduces the concepts of effect size, statistical power, and sample size determination that becomes standard in the behavioral sciences. He provides power tables and conventions for small, medium, and large effect sizes that remain widely used across disciplines.

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.

Perspectives on Psychological ScienceDOI: 10.1177/1745691614551642

Gelman and Carlin extend traditional power analysis by introducing Type S (sign) errors (the probability a significant estimate has the wrong sign) and Type M (magnitude) errors (the expected exaggeration ratio of significant estimates). These concepts provide a richer understanding of what happens in underpowered studies.

Hoenig, J. M., & Heisey, D. M. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.

The American StatisticianDOI: 10.1198/000313001300339897

Hoenig and Heisey demonstrate that post hoc (observed) power calculations are fundamentally flawed because they are a monotone function of the p-value and add no information beyond the test result itself. This paper is essential reading for understanding why power analysis must be conducted before data collection.

McKenzie, D. (2012). Beyond Baseline and Follow-Up: The Case for More T in Experiments.

Journal of Development EconomicsDOI: 10.1016/j.jdeveco.2012.01.002

McKenzie shows that collecting multiple rounds of data can substantially increase statistical power in randomized experiments. He demonstrates that ANCOVA with baseline data and difference-in-differences with multiple time periods can substantially reduce the required sample size, which is particularly valuable in development economics.

Application (2)

Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect Size and Power in Assessing Moderating Effects of Categorical Variables Using Multiple Regression: A 30-Year Review.

Journal of Applied PsychologyDOI: 10.1037/0021-9010.90.1.94

Aguinis, Beaty, Boik, and Pierce review 30 years of moderator analysis in applied psychology and management, finding that most studies are severely underpowered to detect interaction effects. They provide guidelines for computing power for moderated regression.

Muralidharan, K., Niehaus, P., & Sukhtankar, S. (2016). Building State Capacity: Evidence from Biometric Smartcards in India.

American Economic ReviewDOI: 10.1257/aer.20141346

Muralidharan, Niehaus, and Sukhtankar evaluate a large-scale randomized rollout of biometric smartcards for welfare payments in India, finding that the reform improved payment speed, predictability, and integrity. The paper includes detailed ex ante power calculations that demonstrate best practices for reporting minimum detectable effects in cluster-randomized designs.

Survey (3)

Card, D., Kluve, J., & Weber, A. (2018). What Works? A Meta Analysis of Recent Active Labor Market Program Evaluations.

Journal of the European Economic AssociationDOI: 10.1093/jeea/jvx028

Card, Kluve, and Weber conduct a meta-analysis of over 200 active labor market program evaluations across multiple countries, classifying estimates by program type, participant group, and post-program time horizon. They find that average impacts are near zero in the short run but become more positive two to three years after program completion, with human capital programs showing the largest medium-term gains and public employment subsidies proving less effective. Policy researchers designing labor market interventions should consider program type and evaluation time horizon when interpreting treatment effect estimates.

Duflo, E., Glennerster, R., & Kremer, M. (2007). Using Randomization in Development Economics Research: A Toolkit.

Handbook of Development EconomicsDOI: 10.1016/S1573-4471(07)04061-2

Duflo, Glennerster, and Kremer write a comprehensive practical guide to running randomized experiments in development economics. The chapter covers all stages from design to analysis, including power calculations, stratification, dealing with attrition, and estimating treatment effects with imperfect compliance. It has become required reading for anyone designing a field experiment.

Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, H. (2017). The Power of Bias in Economics Research.

Economic JournalDOI: 10.1111/ecoj.12461

Ioannidis, Stanley, and Doucouliagos conduct a large-scale assessment of statistical power in economics research and find that the median power to detect typical effect sizes is only 18%. They document widespread underpowering and publication bias, highlighting the importance of ex ante power analysis.