Practice·Estimation Stage·11 min read

Estimation Stage

Randomization Inference

When conventional asymptotics fail — few clusters, unusual randomization — Fisher's exact approach provides valid inference.

Applies To: Experimental Design, Difference-in-Differences (Canonical 2×2)
Reading Time: ~11 min read · 13 sections · 3 interactive exercises · 11 papers

The Idea That Predates Everything

Before regression. Before t-tests. Before the entire apparatus of parametric inference that fills your econometrics textbook. There was a simpler, more elegant idea.

In 1935, Ronald Fisher proposed the following procedure: if you want to know whether a treatment had an effect, shuffle the treatment labels, recompute your test statistic, and see how unusual your original result is compared to the shuffled versions. No assumptions about normality. No asymptotics. No functional form. Just the physical act of randomization, turned into an inference engine .

This procedure is , and it is experiencing a renaissance in applied economics. Young (2019) applied randomization tests to treatment effects estimated across 53 experimental papers from journals of the American Economic Association. He found 13–22% fewer significant individual treatment effects under randomization inference than authors had reported using conventional methods, and 33–49% fewer significant results in joint tests of the multiple treatment effects appearing together in published tables.

If that evidence does not make you take randomization inference seriously, nothing will.

Why It Matters

Randomization inference provides exact p-values that are valid regardless of sample size or distributional assumptions, making it a key robustness tool for any study where standard inference is questionable.

When Conventional Asymptotics Fail

Standard inference — computing a t-statistic and comparing it to a normal or t-distribution — relies on asymptotics. The central limit theorem guarantees that, with enough observations, the sampling distribution of your estimator is approximately normal. But "enough" is carrying substantial weight in that sentence.

Conventional p-values rely on asymptotic approximations that can fail badly with small samples, few clusters, or non-standard test statistics. When the approximations break down, reported significance levels are wrong — sometimes dramatically so. The situations where conventional inference goes badly wrong fall into four categories:

Small Samples

With 30 or 50 observations, the central limit theorem (CLT) approximation may be poor, especially if the outcome distribution is skewed. Your t-statistic might not actually follow a t-distribution, and your p-value might be too small.

Few Clusters

This limitation is the most consequential one for applied economists. If you have data on students in 20 schools, and treatment is assigned at the school level, you effectively have 20 observations for inference — regardless of how many students you have. With 20 clusters, cluster-robust standard errors can be severely biased downward, leading to over-rejection of the null (Cameron et al., 2008).

Non-Standard Test Statistics

What if you want to test a null hypothesis about a quantile, a ratio, or an interaction? The asymptotic distribution of these statistics may not be normal, and deriving the correct distribution analytically may be difficult. Randomization inference handles any test statistic.

Discrete Outcomes

With binary or count outcomes and small samples, the discreteness of the test statistic means the normal approximation is poor. Randomization inference naturally handles discrete distributions.

How It Works: Fisher's Sharp Null

Step 1: State the Null Hypothesis

The is:

H_0: Y_i(1) = Y_i(0) \quad \text{for all } i

The sharp null says the treatment has zero effect on every single unit. Not on average — on every individual. The sharp null is stronger than the usual null of "no average effect," but it is what makes randomization inference work.

Under this null, every person's observed outcome is the same regardless of whether they were treated or not. That invariance means we know what would have happened under any treatment assignment — because nothing would have changed.

Step 2: Choose a Test Statistic

You can use any test statistic you like. Common choices:

Difference in means: $T = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}$
t-statistic from a regression
Kolmogorov-Smirnov statistic (for distributional differences)
Rank-based statistics (Wilcoxon, etc.)
Any function of the data

This flexibility is a major advantage. There is no requirement that the statistic have a known analytical distribution.

Step 3: Compute the Test Statistic on the Actual Data

Run your analysis as you normally would. Record the test statistic: $T_{\text{obs}}$ .

Step 4: Construct the Randomization Distribution

Now, re-randomize. Under the sharp null, every unit's outcome is fixed. The only thing that changes is the assignment vector. So:

Generate a new treatment assignment consistent with your experimental design (e.g., if you assigned 50 of 100 to treatment, randomly choose a different 50)
Recompute $T$ using this new assignment but the same outcomes
Repeat many times (e.g., 10,000 permutations — or enumerate all $\binom{100}{50}$ assignments if feasible)

The resulting distribution of $T$ values is the randomization distribution — the distribution of your test statistic under the null, given the exact experimental design.

Step 5: Compute the p-Value

The randomization p-value is the fraction of randomization distribution values that are at least as extreme as $T_{\text{obs}}$ :

p = \frac{\#\{|T_{\text{perm}}| \geq |T_{\text{obs}}|\} + 1}{B + 1}

where $B$ is the number of permutations. (The +1 in numerator and denominator ensures the p-value is never exactly zero and includes the observed statistic in the reference set.)

Don't worry about the notation yet — here's what this means in words: Under the sharp null, all potential outcomes are observed. The randomization distribution captures exactly how much variation in the test statistic arises from the random assignment alone — not from any treatment effect.

Under $H_0: Y_i(1) = Y_i(0)$ for all $i$ , the observed outcome $Y_i$ is the same regardless of treatment. So for any permutation of the treatment vector $\mathbf{D}$ , we can compute the test statistic using the observed outcomes and the permuted assignment.

The key insight: the original treatment assignment $\mathbf{D}_{\text{actual}}$ is just one draw from the set of all possible assignments. If the null is true, there is nothing special about the actual assignment — it is exchangeable with any other. The p-value measures how extreme the actual test statistic is relative to this reference set.

This argument requires only that treatment was randomly assigned. No distributional assumptions. No asymptotics. No functional form. The validity comes entirely from the design.

Formally, let $\Omega$ be the set of all feasible treatment assignments under the experimental design, with each assignment equally likely. Then:

p = \frac{1}{|\Omega|} \sum_{\mathbf{d} \in \Omega} \mathbf{1}(|T(\mathbf{d}, \mathbf{Y})| \geq |T(\mathbf{D}_{\text{actual}}, \mathbf{Y})|)

This fraction is an exact probability statement — no approximation involved.

Respecting the Design: Stratified and Cluster Randomization

A critical requirement of randomization inference is that the permutations must mirror the actual experimental design. If you randomized within strata, you typically need to permute within strata. If you randomized at the cluster level, you typically need to permute clusters.

Stratified Randomization

If treatment was assigned within blocks (e.g., 5 of 10 students treated in each classroom), permutations must shuffle treatment labels within each classroom. Cross-classroom permutations would not reflect the actual design and would invalidate the test.

Cluster Randomization

If entire clusters (schools, villages, firms) are assigned to treatment, the unit of permutation is the cluster. You shuffle which clusters are treated, keeping all individuals within a cluster in the same treatment status. With 10 treated and 10 control clusters, you have $\binom{20}{10} = 184{,}756$ possible assignments — enough for exact enumeration if you are patient, or easily approximated with 10,000 random draws.

Randomization Inference vs. the Bootstrap

Readers occasionally conflate randomization inference (RI) with the bootstrap. Both are resampling methods, but they answer different questions and rely on different assumptions.

	Randomization inference	Bootstrap
What is resampled	The treatment assignment vector $D$ , holding observed outcomes fixed	The observed data $(D, Y, X)$ , treating the sample as an estimate of the population
Null hypothesis	Sharp null: $Y_i(1) = Y_i(0)$ for every unit $i$	Weak null: $E[Y_i(1) - Y_i(0)] = 0$ (average effect zero)
Justification	The known assignment mechanism (the experiment's design)	Asymptotic theory: the sample distribution approximates the population distribution
Best for	Small samples, few clusters, exact inference under randomized designs	Inference about means / regression coefficients in moderate-to-large samples
What it does not assume	Distributional form of $Y$ , asymptotic normality, large samples	Knowledge of the assignment mechanism
Canonical references	, (Rosenbaum, 2002), (Imbens & Rubin, 2015), (Athey & Imbens, 2017)	(Efron, 1979), (Hall, 1992)

The practical implication: in a small randomized experiment, RI delivers an exact p-value that does not rely on asymptotic approximations — the resampling enumerates (or samples from) the exact distribution of the test statistic under the design. The bootstrap, in contrast, approximates the sampling distribution of an estimator under the assumption that the empirical distribution converges to the population distribution. With small samples or heavy-tailed outcomes, the bootstrap approximation can be poor; RI sidesteps the approximation entirely.

A complementary observation: when the assignment mechanism is unknown — as in observational data with selection-on-observables — RI is generally not available, and the bootstrap (or analytic asymptotics) remains the default inference tool.

The Sharp Null vs. the Weak Null

A subtlety that matters: randomization inference tests the sharp null ( $Y_i(1) = Y_i(0)$ for all $i$ ), not the weak null ( $E[Y_i(1) - Y_i(0)] = 0$ ). These nulls are different hypotheses.

The sharp null says the treatment has zero effect on every unit. The weak null says the average effect is zero, but allows the treatment to help some individuals and hurt others.

Randomization inference can reject the sharp null even when the average treatment effect (ATE) is exactly zero, if the treatment has heterogeneous effects. Suppose a program helps half the population by +2 and hurts the other half by -2. The ATE is zero, but the distribution of outcomes differs between treatment and control groups, and a test statistic sensitive to distributional differences (like the Kolmogorov-Smirnov statistic) could detect this difference.

This sensitivity to heterogeneity is a feature, not a bug — but be aware of what you are testing. Note that when testing many hypotheses via randomization inference (RI) — for example, across multiple outcomes or subgroups — multiple testing corrections still apply.

Interactive: Permuting Treatment and Seeing the Null Distribution

When to Use Randomization Inference

Setting	Use RI?	Why
RCT with many units	Optional but good practice	Provides finite-sample validity as a complement to standard inference
RCT with few clusters (< 40)	Yes, strongly recommended	Cluster-robust SEs can be badly biased with few clusters
Natural experiment with small sample	Yes	Asymptotic approximation may be poor
DiD with few treated groups	Yes	Permute the treatment timing or treatment group
Any study where you want robustness	Yes	RI can validate or challenge your conventional p-values
Large-N observational study with many controls	Typically not needed	Asymptotics work well; RI adds little

How to Do It: Code

1# Requires: ri2
2library(ri2)
3
4# Declare the experimental design
5# Here: 100 units, 50 assigned to treatment, complete randomization
6declaration <- declare_ra(N = 100, m = 50)
7
8# Conduct randomization inference
9ri_result <- conduct_ri(
10formula = outcome ~ treatment,
11declaration = declaration,
12sharp_hypothesis = 0,  # test sharp null of zero effect
13data = df,
14sims = 5000  # number of permutations
15)
16
17# View the p-value
18summary(ri_result)
19
20# Plot the randomization distribution
21plot(ri_result)

Requiresri2

Cluster-Level Permutation

1# Requires: ri2
2library(ri2)
3
4# Cluster-randomized design: 20 clusters, 10 treated
5declaration <- declare_ra(
6N = nrow(df),
7clusters = df$cluster_id,
8m = 10  # number of treated clusters
9)
10
11ri_result <- conduct_ri(
12formula = outcome ~ treatment,
13declaration = declaration,
14sharp_hypothesis = 0,
15data = df,
16sims = 5000
17)
18
19summary(ri_result)

Requiresri2

How to Report Randomization Inference

A typical reporting pattern:

We supplement conventional inference with randomization inference . Under the sharp null of no treatment effect for any unit, we permute the treatment assignment 5,000 times, respecting the original design (complete randomization within strata). The randomization-based p-value for the main treatment effect is 0.023, compared to the conventional p-value of 0.018. The close agreement between the two suggests that our results are not driven by distributional assumptions or finite-sample bias in standard errors.

When the two p-values diverge substantially, that divergence is informative — it typically means the asymptotic approximation is unreliable, and the randomization inference result is more reliable in such cases.

Common Mistakes

Pitfalls to avoid

Not respecting the experimental design when permuting. Match the permutation scheme to the randomization scheme (within strata for blocked designs; at the cluster level for cluster-randomized designs). Permuting at the wrong level invalidates the test.
Confusing the sharp null with the weak null. Randomization inference tests $Y_i(1) = Y_i(0)$ for all $i$ . This null hypothesis is stronger than "no average effect." RI can reject even when the ATE is zero, if the treatment has heterogeneous effects (helping some and hurting others). This sensitivity is a feature, not a bug, but be aware of it.
Using too few permutations. With 100 permutations, your p-value has a resolution of 0.01 at best. Use at least 1,000, and preferably 5,000–10,000, for reliable p-values. For results near conventional thresholds (p ~ 0.05), more permutations give more precision.
Applying RI to observational data without justification. Randomization inference gets its power from the known assignment mechanism. In observational studies, the assignment mechanism is unknown, so you typically cannot construct the correct permutation distribution without additional assumptions (e.g., conditional independence). Consider pairing RI results with sensitivity analysis to assess how violations of design assumptions would affect your conclusions.
Forgetting the +1 adjustment. The p-value formula should include the observed statistic in the reference distribution: $p = (\#\{|T_{\text{perm}}| \geq |T_{\text{obs}}|\} + 1) / (B + 1)$ . Without the +1, you can get p-values of exactly zero, which are never correct.
Choosing a test statistic that ignores covariates. A studentized statistic (e.g., the t-statistic from a regression with controls) is preferable to a raw difference in means. This choice improves power without compromising validity, because the null distribution is still constructed by permutation.

Concept Check

You run a cluster-randomized trial with 10 treatment clusters and 10 control clusters. Your conventional cluster-robust t-statistic gives p = 0.03. Why might you still want to conduct randomization inference?

Because 20 clusters is enough for asymptotic inference to be reliable.Because with 20 clusters, cluster-robust standard errors tend to be downward-biased (leading to over-rejection), and RI provides an exact test that does not rely on asymptotics.Because randomization inference always gives smaller p-values than conventional tests.Because the treatment might have heterogeneous effects across clusters.

Paper Library

Has replication code

Foundational (8)

Athey, S., & Imbens, G. W. (2017). The Econometrics of Randomized Experiments.

Handbook of Economic Field ExperimentsDOI: 10.1016/bs.hefe.2016.10.003

Athey and Imbens provide a modern, rigorous treatment of the econometrics behind randomized experiments. They cover design, analysis, and inference issues such as stratification, clustering, and multiple hypothesis testing. It is an excellent reference for researchers running field experiments.

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-Based Improvements for Inference with Clustered Errors.

Review of Economics and StatisticsDOI: 10.1162/rest.90.3.414

Cameron, Gelbach, and Miller address what happens when clustering is necessary but the number of clusters is small (fewer than 30-50). They propose the wild cluster bootstrap as a solution, which has become the standard approach when researchers have too few clusters for asymptotic cluster-robust standard errors to be reliable.

Cattaneo, M. D., Frandsen, B. R., & Titiunik, R. (2015). Randomization Inference in the Regression Discontinuity Design: An Application to Party Advantages in the U.S. Senate.

Journal of Causal InferenceDOI: 10.1515/jci-2013-0010

Cattaneo, Frandsen, and Titiunik develop a randomization inference framework for regression discontinuity designs, exploiting the local randomization interpretation of close elections. They apply the method to estimate party advantages in U.S. Senate elections, demonstrating how Fisher-style permutation tests can provide finite-sample exact inference in RDD settings where asymptotic approximations may be unreliable.

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife.

Annals of StatisticsDOI: 10.1214/aos/1176344552

Efron introduces the bootstrap as a resampling-based alternative to the jackknife for estimating sampling distributions of statistics. The bootstrap approximates the unknown population distribution by repeated sampling with replacement from the observed data, and forms the foundation of modern resampling-based inference.

Fisher, R. A. (1935). The Design of Experiments.

Oliver & Boyd

Fisher's classic book lays the foundations of experimental design, introducing concepts like randomization, blocking, and factorial designs. The 'lady tasting tea' example from this book remains one of the most famous illustrations of hypothesis testing and the logic of controlled experiments.

Hall, P. (1992). The Bootstrap and Edgeworth Expansion.

Springer Series in StatisticsDOI: 10.1007/978-1-4612-4384-7

Hall provides the canonical theoretical treatment of the bootstrap, deriving its higher-order accuracy via Edgeworth expansions and characterizing when bootstrap approximations improve on first-order asymptotic theory. The reference text on bootstrap theory for inference about means and regression coefficients.

Harrison, G. W., & List, J. A. (2004). Field Experiments.

Journal of Economic LiteratureDOI: 10.1257/0022051043004577

Harrison and List provide an influential taxonomy of field experiments, distinguishing artefactual, framed, and natural field experiments from conventional lab experiments. The paper helps establish field experiments as a mainstream methodology in economics.

Heß, S. (2017). Randomization Inference with Stata: A Guide and Software.

Stata JournalDOI: 10.1177/1536867X1701700306

Heß develops the ritest Stata command and provide a practical guide to implementing randomization inference. The paper covers standard and clustered randomization designs and demonstrates how to conduct Fisher exact tests for a variety of experimental and quasi-experimental settings.

Application (1)

Young, A. (2019). Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjy029

Young applies randomization inference to a large sample of experimental papers published in top economics journals and finds that many results that appear significant under conventional inference are insignificant under randomization tests. This paper demonstrates the practical importance of randomization inference for credible empirical research.

Survey (2)

Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction.

Cambridge University PressDOI: 10.1017/CBO9781139025751

Imbens and Rubin provide a comprehensive textbook grounding causal inference in the potential outcomes framework, with detailed treatment of matching, propensity scores, and subclassification. They provide rigorous foundations for selection-on-observables methods.

Rosenbaum, P. R. (2002). Observational Studies.

SpringerDOI: 10.1007/978-1-4757-3692-2

Rosenbaum provides the standard textbook on observational study design, covering matching, sensitivity analysis, and design principles for drawing causal inferences from non-experimental data. His framework for sensitivity analysis (Rosenbaum bounds) is the standard tool for assessing how much unobserved confounding would be needed to overturn a matching-based finding.

The Idea That Predates Everything#

Why It Matters#

When Conventional Asymptotics Fail#

Small Samples#

Few Clusters#

Non-Standard Test Statistics#

Discrete Outcomes#

How It Works: Fisher's Sharp Null#

Step 1: State the Null Hypothesis#

Step 2: Choose a Test Statistic#

Step 3: Compute the Test Statistic on the Actual Data#

Step 4: Construct the Randomization Distribution#

Step 5: Compute the p-Value#

Respecting the Design: Stratified and Cluster Randomization#

Stratified Randomization#

Cluster Randomization#

Randomization Inference vs. the Bootstrap#

The Sharp Null vs. the Weak Null#

Interactive: Permuting Treatment and Seeing the Null Distribution#

When to Use Randomization Inference#

How to Do It: Code#

Cluster-Level Permutation#

How to Report Randomization Inference#

Common Mistakes#

Concept Check#