Randomization Inference
When conventional asymptotics fail — few clusters, unusual randomization — Fisher's exact approach provides valid inference.
The Idea That Predates Everything
Before regression. Before t-tests. Before the entire apparatus of parametric inference that fills your econometrics textbook. There was a simpler, more elegant idea.
In 1935, Ronald Fisher proposed the following procedure: if you want to know whether a treatment had an effect, shuffle the treatment labels, recompute your test statistic, and see how unusual your original result is compared to the shuffled versions. No assumptions about normality. No asymptotics. No functional form. Just the physical act of randomization, turned into an inference engine.
This procedure is , and it is experiencing a renaissance in applied economics. Young (2019) applied randomization tests to 53 experimental papers from top economics journals and found that 13–22% fewer results were significant compared to conventional methods.
(Young, 2019)If that evidence does not make you take randomization inference seriously, nothing will.
Why It Matters
Conventional p-values rely on asymptotic approximations that can fail badly with small samples, few clusters, or non-standard test statistics. When these approximations break down, your reported significance levels are wrong — sometimes dramatically so. Randomization inference provides exact p-values that are valid regardless of sample size or distributional assumptions, making it an essential robustness tool for any study where standard inference is questionable.
When Conventional Asymptotics Fail
Standard inference — computing a t-statistic and comparing it to a normal or t-distribution — relies on asymptotics. The central limit theorem guarantees that, with enough observations, the sampling distribution of your estimator is approximately normal. But "enough" is carrying substantial weight in that sentence.
Here are the situations where conventional inference can go badly wrong:
Small Samples
With 30 or 50 observations, the CLT approximation may be poor, especially if the outcome distribution is skewed. Your t-statistic might not actually follow a t-distribution, and your p-value might be too small.
Few Clusters
This limitation is the most consequential one for applied economists. If you have data on students in 20 schools, and treatment is assigned at the school level, you effectively have 20 observations for inference — regardless of how many students you have. With 20 clusters, cluster-robust standard errors can be severely biased downward, leading to over-rejection of the null.
(Cameron et al., 2008)Non-Standard Test Statistics
What if you want to test a null hypothesis about a quantile, a ratio, or an interaction? The asymptotic distribution of these statistics may not be normal, and deriving the correct distribution analytically may be difficult. Randomization inference handles any test statistic.
Discrete Outcomes
With binary or count outcomes and small samples, the discreteness of the test statistic means the normal approximation is poor. Randomization inference naturally handles discrete distributions.
How It Works: Fisher's Sharp Null
Step 1: State the Null Hypothesis
The is:
This says the treatment has zero effect on every single unit. Not on average — on every individual. This null hypothesis is stronger than the usual null of "no average effect," but it is what makes randomization inference work.
Under this null, every person's observed outcome is the same regardless of whether they were treated or not. That invariance means we know what would have happened under any treatment assignment — because nothing would have changed.
Step 2: Choose a Test Statistic
You can use any test statistic you like. Common choices:
- Difference in means:
- t-statistic from a regression
- Kolmogorov-Smirnov statistic (for distributional differences)
- Rank-based statistics (Wilcoxon, etc.)
- Any function of the data
This flexibility is a major advantage. There is no requirement that the statistic have a known analytical distribution.
Step 3: Compute the Test Statistic on the Actual Data
Run your analysis as you normally would. Record the test statistic: .
Step 4: Construct the Randomization Distribution
Now, re-randomize. Under the sharp null, every unit's outcome is fixed. The only thing that changes is the assignment vector. So:
- Generate a new treatment assignment consistent with your experimental design (e.g., if you assigned 50 of 100 to treatment, randomly choose a different 50)
- Recompute using this new assignment but the same outcomes
- Repeat many times (e.g., 10,000 permutations — or enumerate all assignments if feasible)
The resulting distribution of values is the randomization distribution — the distribution of your test statistic under the null, given the exact experimental design.
Step 5: Compute the p-Value
The randomization p-value is the fraction of randomization distribution values that are at least as extreme as :
where is the number of permutations. (The +1 in numerator and denominator ensures the p-value is never exactly zero and includes the observed statistic in the reference set.)
Don't worry about the notation yet — here's what this means in words: Under the sharp null, all potential outcomes are observed. The randomization distribution captures exactly how much variation in the test statistic arises from the random assignment alone — not from any treatment effect.
Under for all , the observed outcome is the same regardless of treatment. So for any permutation of the treatment vector , we can compute the test statistic using the observed outcomes and the permuted assignment.
The key insight: the original treatment assignment is just one draw from the set of all possible assignments. If the null is true, there is nothing special about the actual assignment — it is exchangeable with any other. The p-value measures how extreme the actual test statistic is relative to this reference set.
This argument requires only that treatment was randomly assigned. No distributional assumptions. No asymptotics. No functional form. The validity comes entirely from the design.
Formally, let be the set of all feasible treatment assignments under the experimental design, with each assignment equally likely. Then:
This fraction is an exact probability statement — no approximation involved.
Respecting the Design: Stratified and Cluster Randomization
A critical requirement of randomization inference is that the permutations must mirror the actual experimental design. If you randomized within strata, you must permute within strata. If you randomized at the cluster level, you must permute clusters.
Stratified Randomization
If treatment was assigned within blocks (e.g., 5 of 10 students treated in each classroom), permutations must shuffle treatment labels within each classroom. Cross-classroom permutations would not reflect the actual design and would invalidate the test.
Cluster Randomization
If entire clusters (schools, villages, firms) are assigned to treatment, the unit of permutation is the cluster. You shuffle which clusters are treated, keeping all individuals within a cluster in the same treatment status. With 10 treated and 10 control clusters, you have possible assignments — enough for exact enumeration if you are patient, or easily approximated with 10,000 random draws.
The Sharp Null vs. the Weak Null
A subtlety that matters: randomization inference tests the sharp null ( for all ), not the weak null (). These nulls are different hypotheses.
The sharp null says the treatment has zero effect on every unit. The weak null says the average effect is zero, but allows the treatment to help some individuals and hurt others.
Randomization inference can reject the sharp null even when the ATE is exactly zero, if the treatment has heterogeneous effects. Suppose a program helps half the population by +2 and hurts the other half by -2. The ATE is zero, but the distribution of outcomes differs between treatment and control groups, and a test statistic sensitive to distributional differences (like the Kolmogorov-Smirnov statistic) could detect this difference.
This sensitivity to heterogeneity is a feature, not a bug — but be aware of what you are testing. Note that when testing many hypotheses via RI — for example, across multiple outcomes or subgroups — multiple testing corrections still apply.
Interactive: Permuting Treatment and Seeing the Null Distribution
Randomization Inference Simulator
Watch randomization inference in action. We generate data with a true effect (or not), compute the observed difference in means, then repeatedly shuffle the treatment labels to build the null distribution. The p-value is the fraction of shuffled statistics more extreme than the observed one. Set the true effect to zero and watch the observed statistic land in the middle of the distribution. Increase it and watch it migrate to the tail.
When to Use Randomization Inference
| Setting | Use RI? | Why |
|---|---|---|
| RCT with many units | Optional but good practice | Provides finite-sample validity as a complement to standard inference |
| RCT with few clusters (< 40) | Yes, strongly recommended | Cluster-robust SEs can be badly biased with few clusters |
| Natural experiment with small sample | Yes | Asymptotic approximation may be poor |
| DiD with few treated groups | Yes | Permute the treatment timing or treatment group |
| Any study where you want robustness | Yes | RI can validate or challenge your conventional p-values |
| Large-N observational study with many controls | Typically not needed | Asymptotics work well; RI adds little |
How to Do It: Code
library(ri2)
# Declare the experimental design
# Here: 100 units, 50 assigned to treatment, complete randomization
declaration <- declare_ra(N = 100, m = 50)
# Conduct randomization inference
ri_result <- conduct_ri(
formula = outcome ~ treatment,
declaration = declaration,
sharp_hypothesis = 0, # test sharp null of zero effect
data = df,
sims = 5000 # number of permutations
)
# View the p-value
summary(ri_result)
# Plot the randomization distribution
plot(ri_result)Cluster-Level Permutation
library(ri2)
# Cluster-randomized design: 20 clusters, 10 treated
declaration <- declare_ra(
N = nrow(df),
clusters = df$cluster_id,
m = 10 # number of treated clusters
)
ri_result <- conduct_ri(
formula = outcome ~ treatment,
declaration = declaration,
sharp_hypothesis = 0,
data = df,
sims = 5000
)
summary(ri_result)How to Report Randomization Inference
A typical reporting pattern:
We supplement conventional inference with randomization inference . Under the sharp null of no treatment effect for any unit, we permute the treatment assignment 5,000 times, respecting the original design (complete randomization within strata). The randomization-based p-value for the main treatment effect is 0.023, compared to the conventional p-value of 0.018. The close agreement between the two confirms that our results are not driven by distributional assumptions or finite-sample bias in standard errors.
When the two p-values diverge substantially, that divergence is informative — it typically means the asymptotic approximation is unreliable, and the randomization inference result is more reliable in such cases.
Common Mistakes
Concept Check
You run a cluster-randomized trial with 10 treatment clusters and 10 control clusters. Your conventional cluster-robust t-statistic gives p = 0.03. Why might you still want to conduct randomization inference?