Practice·Design Stage·10 min read

Design Stage

Power Analysis & Sample-Size Planning

How large a sample do you need to detect your effect? Power calculations prevent underpowered studies.

Applies To: Experimental Design, Difference-in-Differences (Canonical 2×2)
Reading Time: ~10 min read · 13 sections · 3 interactive exercises · 10 papers

When to Use Power Analysis

Conduct a power analysis any time you are designing a study where you will collect new data — randomized experiments, surveys, field trials, or lab studies. It is also valuable when planning a quasi-experimental analysis and you want to know whether your sample is large enough to detect policy-relevant effects. Common settings include: randomized controlled trials (RCTs) with individual or cluster randomization, difference-in-differences designs with a known policy change, regression discontinuity designs near a cutoff, and grant proposals (where reviewers will expect a power calculation).

The Key Question Before Collecting Data

You are designing a study. You have a treatment, an outcome, and a plan. The question that will determine whether your study succeeds or fails — before a single observation is collected — is this one: how large does your sample need to be?

Too small, and you will not be able to detect the effect even if it exists. You will waste time, money, and possibly the goodwill of participants. Too large, and you are spending resources that could have been used elsewhere. Power analysis is the tool that gets you to the right number.

And yet, a remarkable fraction of studies in the social sciences are underpowered. Ioannidis et al. (2017) surveyed the economics literature — over 64,000 estimates from more than 6,700 studies across 159 empirical research areas — and found that the median was 18% or less, meaning most studies had less than a one-in-five chance of detecting the effects they claimed to be studying.

Low statistical power is not a minor technical problem. Underpowered studies produce noisy estimates that are uninformative. Worse, the significant results that do emerge from underpowered literatures tend to be exaggerated — a phenomenon known as the "winner's curse" or Type M error. If your field is full of underpowered studies, the published estimates systematically overstate the truth.

The Four Ingredients

Every power calculation involves four quantities. Fix any three, and the fourth is determined:

Significance level ( $\alpha$ ): The probability of a Type I error (false positive). By long-standing field convention $\alpha = 0.05$ (and sometimes 0.01 or 0.10), reflecting an implicit judgment about the tolerable false-positive rate; it is not a derived quantity. Confirmatory clinical trials sometimes use stricter thresholds; some exploratory work uses 0.10.
Power ( $1 - \beta$ ): The probability of detecting a true effect — rejecting the null when it is false. Conventionally 0.80 or 0.90, again as field convention rather than as a theorem. The choice reflects the relative cost of Type II versus Type I errors in the setting at hand.
Effect size: How large the treatment effect is, usually expressed in standard deviations (Cohen's $d$ ) or in the natural units of the outcome.
Sample size ( $N$ ): The number of observations.

The fundamental trade-off: holding everything else fixed, detecting smaller effects requires larger samples .

The Formulas

Two-Sample Test (Simple RCT)

For a two-sample comparison of means with equal group sizes ( $n$ per group, $N = 2n$ total):

Don't worry about the notation yet — here's what this means in words: The MDE formula follows directly from the rejection rule of a two-sided t-test. You need the true effect to shift the sampling distribution far enough from zero that the test rejects with the desired probability.

Under the null ( $H_0: \mu_T - \mu_C = 0$ ), the test statistic $T = (\bar{Y}_T - \bar{Y}_C) / \text{SE}$ follows approximately a standard normal. You reject when $|T| > z_{1-\alpha/2}$ .

Under the alternative ( $H_A: \mu_T - \mu_C = \delta$ ), the test statistic has a non-central distribution centered at $\delta / \text{SE}$ . Power is:

1 - \beta = P\left(\frac{\delta}{\text{SE}} - z_{1-\alpha/2} > -z_{\beta}\right)

Setting this to the desired power and solving:

\delta = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \text{SE} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \sqrt{2/n}

This expression is the MDE. Solving for $n$ :

n = \frac{2\sigma^2(z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}

For $\alpha = 0.05$ and power = 0.80, the critical values are $z_{0.975} \approx 1.96$ and $z_{0.80} \approx 0.84$ , giving $(z_{1-\alpha/2} + z_{1-\beta})^2 \approx 7.85$ .

In terms of the standardized effect size $d = \delta / \sigma$ :

n \approx \frac{16}{d^2} \text{ per group}

To detect $d = 0.2$ , you need about 400 per group. For $d = 0.5$ , about 64 per group.

The minimum detectable effect (MDE):

\text{MDE} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \cdot \sqrt{\frac{2}{n}}

The required sample size per group:

n = \frac{2\sigma^2(z_{1-\alpha/2} + z_{1-\beta})^2}{\text{MDE}^2}

The is often more useful than the required sample size, because it answers the practical question directly: "Given the sample I can afford, what is the smallest effect I can reliably detect?"

Power for Cluster-Randomized Designs

When treatment is assigned at the cluster level — at the level of schools, villages, or clinics — the effective sample size is much smaller than the total number of individuals. The key parameter is the , denoted $\rho$ : the fraction of total outcome variance that is between clusters rather than within clusters.

The is:

\text{DEFF} = 1 + (m - 1)\rho

where $m$ is the average cluster size. The MDE for a cluster-randomized trial with $J$ clusters per arm is:

\text{MDE} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \cdot \sqrt{\frac{2}{J}} \cdot \sqrt{\rho + \frac{1-\rho}{m}}

Power for Difference-in-Differences

Power analysis for DiD designs requires additional considerations:

Serial correlation. Outcomes measured over time within the same unit are correlated, which affects the effective sample size. Bertrand et al. (2004) showed that ignoring serial correlation leads to dramatically inflated rejection rates.
Number of pre- and post-periods. More pre-treatment periods improve precision by pinning down the counterfactual trend. McKenzie (2012) shows that analysis of covariance (ANCOVA) specifications exploiting baseline data are generally more powerful than simple DiD, especially when baseline autocorrelation is high.
Fraction treated. In DiD the fraction of treated units is often fixed by the policy. Power depends on this fraction — the optimal split is 50/50, and power drops as the split becomes more unequal.

The effective MDE for a DiD with $J$ groups, $T$ time periods, fraction treated $p$ , and ICC $\rho$ :

\text{MDE}_{\text{DiD}} \approx (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sigma \cdot \sqrt{\frac{1}{p(1-p) \cdot J} \cdot \left[\rho + \frac{1 - \rho}{T}\right]}

Strategies for Improving Power on a Fixed Budget

If your budget is fixed and your sample is limited, several strategies can improve power without adding observations:

. on strong predictors of the outcome. This blocking ensures balance and reduces residual variance.
ANCOVA with baseline covariates. Controlling for the pre-treatment value of the outcome dramatically reduces variance. McKenzie (2012) shows that ANCOVA is generally more powerful than a simple difference in means or a DiD.
Multiple post-treatment measurements. Averaging over multiple follow-up rounds reduces noise.
Optimal allocation. If treatment costs differ from control costs, unequal allocation (e.g., 2:1 treatment-to-control) can improve efficiency.
Reduce attrition. Every lost observation reduces power. Invest in tracking and retention. When differential attrition is unavoidable, Lee bounds can provide valid inference under a monotonicity assumption.
Choose the right test statistic. A studentized statistic from a regression with covariates can be substantially more powerful than a raw difference in means. When conventional asymptotics are unreliable (e.g., few clusters), randomization inference provides a valid alternative that can be more powerful with the right test statistic.

How to Choose the Effect Size

Choosing the effect size is the hardest part, and there is no formula for it:

Prior studies. What have previous papers found for similar interventions? Beware — published effects are systematically larger than true effects.
Pilot data. If you have run a small pilot, use it to estimate the effect size and variance. But pilot estimates are noisy, so treat them as rough guides.
Policy relevance. What is the smallest effect that would matter for policy? If a job training program must increase earnings by at least $500/year to justify its cost, power for that threshold.
Cohen's conventions. Small ( $d = 0.2$ ), medium ( $d = 0.5$ ), large ( $d = 0.8$ ). The conventions are heuristic benchmarks from calibrated to mid-20th-century psychology effect sizes; they are not theorems and they map poorly onto many applied fields (e.g., economics policy effects are often $d < 0.1$ in standardized units). Use them as a last resort if no field-specific benchmark exists, and prefer policy-relevant units when possible.

Interactive: Sample Size Explorer

Try setting the effect size to 0.1 (a "small" effect in many social science settings). Notice how much sample you need. Now set it to 0.5 and watch the required sample plummet. Then set the ICC to 0.10 and the cluster size to 50 — the required sample jumps dramatically.

How to Do It: Code

Basic Power Analysis

1# --- Step 1: Load the pwr package ---
2# pwr provides analytic power calculations for common test types
3library(pwr)
4
5# --- Step 2: Compute required sample size per group ---
6# Given: effect size d = 0.2, alpha = 0.05, power = 0.80
7pwr.t.test(
8d = 0.20,          # standardized effect size (Cohen's d)
9sig.level = 0.05,  # Type I error rate
10power = 0.80,      # desired probability of detecting a true effect
11type = "two.sample",
12alternative = "two.sided"
13)
14# Output: n = required observations per group (treatment and control)
15
16# --- Step 3: Compute MDE given a fixed sample size ---
17# Useful when your budget constrains N; answers "what can I detect?"
18pwr.t.test(
19n = 500,            # per group (fixed by budget)
20sig.level = 0.05,
21power = 0.80,
22type = "two.sample"
23)
24# Output: d = minimum detectable effect in SD units

Requirespwr

Simulation-Based and Cluster Power

1# --- Step 1: Simulation-based power with DeclareDesign ---
2# DeclareDesign lets you simulate any research design and diagnose power
3library(DeclareDesign)
4
5# Declare the full design: model, estimand, assignment, estimator
6design <- declare_model(
7N = 500,                                   # sample size
8U = rnorm(N),                              # individual-level noise
9potential_outcomes(Y ~ 0.3 * Z + U)        # true ATE = 0.3
10) +
11declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +  # target estimand
12declare_assignment(Z = complete_ra(N, m = 250)) +  # 1:1 randomization
13declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
14declare_estimator(Y ~ Z, inquiry = "ATE")   # difference-in-means
15
16# diagnose_design runs sims simulations and reports power, bias, RMSE
17diagnosis <- diagnose_design(design, sims = 500)
18diagnosis  # check the "Power" column for your test
19
20# --- Step 2: Cluster-RCT power using clusterPower ---
21# Analytic power formula for two-arm cluster-randomized trials
22library(clusterPower)
23crtpwr.2mean(
24alpha = 0.05,
25power = 0.80,
26m = 50,         # individuals per cluster
27n = 20,         # clusters per arm
28d = 0.30,       # expected raw difference in means
29icc = 0.05,     # intracluster correlation (between-cluster variance share)
30varw = 1        # within-cluster variance
31)
32# Output: computed power (or required n if power is left NULL)

RequiresDeclareDesign clusterPower

How to Report Power Analysis

A complete power analysis section includes:

The target effect size and its justification. Where did the number come from? Prior studies? Policy relevance? An MDE argument?
Key parameters. Standard deviation of the outcome (ideally from pilot data or similar studies), ICC if clustered, significance level, desired power.
The computed sample size (or MDE, if sample is fixed).
Sensitivity. Show how the required sample changes under different assumptions about effect size, ICC, or attrition.

Example write-up:

We compute the minimum detectable effect for our main outcome (employment at 12 months) given our sample of 2,000 individuals randomized 1:1 to treatment and control. Based on administrative data, the control group employment rate is 55% with a standard deviation of 0.50. At $\alpha = 0.05$ and 80% power, our MDE is 4.5 percentage points (an 8% increase over the control mean). This MDE is comparable to the 5-percentage-point effects found by Card et al. (2018) in their meta-analysis of active labor market programs, suggesting our study is adequately powered for policy-relevant effects.

Common Mistakes

Pitfalls to avoid

Using an implausibly large effect size to justify a small sample. This error is among the most frequently observed problems in practice. If your power calculation assumes $d = 0.5$ but the literature finds $d = 0.1$ , your study is underpowered regardless of what the calculation says. Be honest about the expected effect size.
Ignoring clustering. If randomization is at the cluster level, power depends primarily on the number of clusters, not the number of individuals. Adding more students per school does little if the ICC is non-trivial.
Forgetting about attrition. If you expect 20% attrition, your effective sample is 80% of your recruited sample. Power calculations are best based on the expected final sample, not initial enrollment. Inflate your target by $1/(1 - \text{attrition rate})$ .
Computing power after the study (post-hoc power). Post-hoc power — computed using the observed effect size — is uninformative and misleading. It is a deterministic function of the p-value and tells you nothing you did not already know. Hoenig and Heisey (2001) provide a thorough explanation of why observed power is uninformative.
Treating the power calculation as a single number. Report a range. What happens if the ICC is 0.03 vs. 0.10? What if attrition is 10% vs. 30%? What if the effect size is half of what you expect? A power analysis is a sensitivity exercise, not a point estimate.
Not accounting for covariates. If you plan to control for strong predictors of the outcome (especially baseline values), this conditioning can substantially improve power. A power calculation ideally reflects the residual variance after conditioning on covariates, not the marginal variance.

Concept Check

You are designing a cluster-randomized trial in 30 schools (15 treatment, 15 control), with 200 students per school. The outcome is a test score with ICC = 0.10. A colleague says: 'We have 6,000 students — that number is a huge sample, we must be well-powered.' What is wrong with this reasoning?

Nothing — 6,000 students is indeed a very large sample for a two-sample comparison.The effective sample size is determined by the number of clusters (30), not the total number of students (6,000). With ICC = 0.10 and 200 students per school, the design effect is about 21, dramatically reducing effective power.The sample is too large — they should use fewer schools to save money.They should use student-level randomization instead of cluster randomization.

Paper Library

Has replication code

Foundational (5)

Bloom, H. S. (1995). Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.

Evaluation ReviewDOI: 10.1177/0193841X9501900504

Bloom introduces the minimum detectable effect (MDE) framework, which reports the smallest effect size a study can reliably detect given its design and sample size. This approach is now the standard way to discuss statistical power in program evaluation and experimental economics.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences.

Lawrence Erlbaum Associates

Cohen's foundational textbook introduces the concepts of effect size, statistical power, and sample size determination that becomes standard in the behavioral sciences. He provides power tables and conventions for small, medium, and large effect sizes that remain widely used across disciplines.

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.

Perspectives on Psychological ScienceDOI: 10.1177/1745691614551642

Gelman and Carlin extend traditional power analysis by introducing Type S (sign) errors (the probability a significant estimate has the wrong sign) and Type M (magnitude) errors (the expected exaggeration ratio of significant estimates). These concepts provide a richer understanding of what happens in underpowered studies.

Hoenig, J. M., & Heisey, D. M. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.

The American StatisticianDOI: 10.1198/000313001300339897

Hoenig and Heisey demonstrate that post hoc (observed) power calculations are fundamentally flawed because they are a monotone function of the p-value and add no information beyond the test result itself. This paper is essential reading for understanding why power analysis must be conducted before data collection.

McKenzie, D. (2012). Beyond Baseline and Follow-Up: The Case for More T in Experiments.

Journal of Development EconomicsDOI: 10.1016/j.jdeveco.2012.01.002

McKenzie shows that collecting multiple rounds of data can substantially increase statistical power in randomized experiments. He demonstrates that ANCOVA with baseline data and difference-in-differences with multiple time periods can substantially reduce the required sample size, which is particularly valuable in development economics.

Application (2)

Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect Size and Power in Assessing Moderating Effects of Categorical Variables Using Multiple Regression: A 30-Year Review.

Journal of Applied PsychologyDOI: 10.1037/0021-9010.90.1.94

Aguinis, Beaty, Boik, and Pierce review 30 years of moderator analysis in applied psychology and management, finding that most studies are severely underpowered to detect interaction effects. They provide guidelines for computing power for moderated regression.

Muralidharan, K., Niehaus, P., & Sukhtankar, S. (2016). Building State Capacity: Evidence from Biometric Smartcards in India.

American Economic ReviewDOI: 10.1257/aer.20141346

Muralidharan, Niehaus, and Sukhtankar evaluate a large-scale randomized rollout of biometric smartcards for NREGS welfare payments across 157 Indian subdistricts (covering approximately 19 million beneficiaries), finding that the reform delivered faster, more predictable, and less corrupt payments. The paper is widely cited as an example of well-powered, large-scale cluster-randomized policy evaluation in development economics.

Survey (3)

Card, D., Kluve, J., & Weber, A. (2018). What Works? A Meta Analysis of Recent Active Labor Market Program Evaluations.

Journal of the European Economic AssociationDOI: 10.1093/jeea/jvx028

Card, Kluve, and Weber provide a comprehensive meta-analysis of over 200 active labor market program evaluations, finding that training programs tend to show positive medium-term effects while public employment subsidies are less effective.

Duflo, E., Glennerster, R., & Kremer, M. (2007). Using Randomization in Development Economics Research: A Toolkit.

Handbook of Development EconomicsDOI: 10.1016/S1573-4471(07)04061-2

Duflo, Glennerster, and Kremer write a comprehensive practical guide to running randomized experiments in development economics. The chapter covers all stages from design to analysis, including power calculations, stratification, dealing with attrition, and estimating treatment effects with imperfect compliance. It has become required reading for anyone designing a field experiment.

Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, H. (2017). The Power of Bias in Economics Research.

Economic JournalDOI: 10.1111/ecoj.12461

Ioannidis, Stanley, and Doucouliagos conduct a large-scale assessment of statistical power in economics research and find that the median power to detect typical effect sizes is only 18%. They document widespread underpowering and publication bias, highlighting the importance of ex ante power analysis.

When to Use Power Analysis#

The Key Question Before Collecting Data#

The Four Ingredients#

The Formulas#

Two-Sample Test (Simple RCT)#

Power for Cluster-Randomized Designs#

Power for Difference-in-Differences#

Strategies for Improving Power on a Fixed Budget#

How to Choose the Effect Size#

Interactive: Sample Size Explorer#

How to Do It: Code#

Basic Power Analysis#

Simulation-Based and Cluster Power#

How to Report Power Analysis#

Common Mistakes#

Concept Check#