Clustering and Few-Cluster Inference
When to cluster standard errors, at what level, and what to do when you have few clusters.
The Clustering Problem
You run a regression of student test scores on a school-level policy. You have 500,000 students in 50 states. Your standard errors look tight. Your p-value is 0.001. You feel confident.
You should not.
The policy varies at the state level. Students within the same state share the same policy, the same funding formula, the same regulatory environment. Their residuals are correlated — not because of anything you did wrong, but because the world is structured in groups. Ignoring that correlation makes your standard errors too small, your t-statistics too large, and your confidence intervals too narrow.
Within-cluster correlation is the clustering problem, and getting it wrong is one of the most common inferential errors in applied research.
The Moulton Factor: Why Naive SEs Fail
The intuition is simple. If observations within clusters are correlated, they contain less independent information than their raw count suggests. Your effective sample size is not the number of observations — it is closer to the number of clusters.
quantifies exactly how much naive standard errors understate true uncertainty. For a cluster-level regressor, the inflation factor is approximately:
where is the average cluster size and is the — the fraction of total residual variance that is between-cluster rather than within-cluster.
The numbers are startling. Even a modest intraclass correlation of with an average cluster size of yields:
Your naive standard errors are 86% too small. A t-statistic of 2.5 — seemingly significant at the 1% level — becomes 1.34 after correction. Not significant at any conventional level.
(Bertrand et al., 2004)Why Does Correlation Arise?
Within-cluster correlation of residuals arises from several sources:
-
Shared treatments. If a policy is implemented at the state level, all individuals in the state experience the same treatment shock. Any state-level unobservable that affects outcomes creates residual correlation.
-
Shared environments. Students in the same school share teachers, resources, and peer effects. Workers in the same firm share management quality and organizational culture.
-
Sampling design. If you sample clusters (schools, villages) and then sample individuals within clusters, the sampling itself induces correlation.
-
Serial correlation. In panel data, the same unit observed over time has correlated residuals across periods. Serial correlation is clustering over time within units.
When to Cluster: The Abadie et al. Framework
Not all regressions require clustered standard errors. The decision of whether and how to cluster is not a mechanical one — it depends on the research design, the source of variation, and the sampling process.
(Abadie et al., 2023)Abadie, Athey, Imbens, and Wooldridge (2023) provide a principled framework. Their key insight is that clustering addresses two distinct problems, and you should cluster when either applies:
Reason 1: Clustered Treatment Assignment
If treatment is assigned at the cluster level (e.g., a policy enacted at the state level), you must cluster at least at that level. The assignment mechanism induces correlation in treatment status within clusters, which in turn creates correlation in residuals.
Rule: Cluster at the level of treatment assignment or higher.
Reason 2: Clustered Sampling
If you sample clusters from a population of clusters (e.g., you sample 50 schools from 5,000 schools in the country), you should cluster at the sampling level to account for the fact that your clusters are a random draw.
Rule: If clusters are sampled, cluster at the sampling level.
When Not to Cluster
If treatment varies at the individual level and you observe the full population (or a simple random sample) of clusters, clustering may not be necessary. In this case, heteroscedasticity-robust standard errors suffice.
| Setting | Cluster? | Level |
|---|---|---|
| State-level policy, individual data | Yes | State |
| School-level randomization, student outcomes | Yes | School |
| Individual-level randomization, no clustering in sampling | Not required | — (use robust SEs) |
| Firm-level panel, firm FE | Yes | Firm (at minimum) |
| DiD with state-level treatment variation | Yes | State |
| Multi-site RCT, randomized within site | Depends | Site (if sites are sampled from a larger population) |
Cluster-Robust Standard Errors
are computed via a sandwich estimator that allows for arbitrary within-cluster correlation. The key formula for the variance of the OLS estimator is:
where indexes clusters, is the matrix of regressors for cluster , and is the vector of residuals for cluster .
CR0 vs. CR1 vs. CR2
Different variants apply different finite-sample corrections:
CR0 (no correction): The formula above, used by Stata's default cluster() option (with a degrees-of-freedom adjustment of , making it effectively CR1).
CR1 (degrees-of-freedom correction): Multiplies the variance by , analogous to the HC1 correction for heteroscedasticity-robust standard errors. CR1 is the most commonly used variant in practice.
CR2 (bias-corrected): Applies a cluster-level leverage correction analogous to HC2. The CR2 estimator uses:
where is a correction matrix based on the cluster-level hat matrix. CR2 is less biased than CR1 with few clusters and is particularly recommended when cluster sizes are unbalanced.
(Pustejovsky & Tipton, 2018)Degrees of Freedom
With clustered standard errors, the relevant degrees of freedom for inference is approximately (the number of clusters minus one), not (observations minus parameters). This matters enormously when is small. A t-statistic of 2.1 is significant at 5% with 100 degrees of freedom but not with 10 degrees of freedom (critical value 2.23).
Code: Cluster-Robust Standard Errors
library(fixest)
library(clubSandwich)
# --- CR1 (default in fixest) ---
model <- feols(outcome ~ treatment + x1 + x2 | state_fe,
data = df,
cluster = ~state_id)
summary(model) # reports CR1 clustered SEs
# --- CR2 (bias-corrected, via clubSandwich) ---
model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(state_fe),
data = df)
cr2_vcov <- vcovCR(model_lm, cluster = df$state_id, type = "CR2")
coef_test(model_lm, vcov = cr2_vcov, test = "Satterthwaite")
# --- Two-way clustering (state and year) ---
model_2way <- feols(outcome ~ treatment + x1 + x2 | state_fe + year_fe,
data = df,
cluster = ~state_id + year_id)
summary(model_2way)The Few-Cluster Problem
Cluster-robust standard errors have an asymptotic justification: they are consistent as the number of clusters . But what happens when is small?
The answer: things break down. And "small" starts sooner than you might think.
What Counts as "Few"?
There is no sharp cutoff, but the empirical evidence is clear:
| Number of Clusters () | Reliability of CR1 SEs |
|---|---|
| Generally reliable | |
| Moderate concern; CR2 + Satterthwaite df helps | |
| Substantial concern; wild cluster bootstrap recommended | |
| Serious concern; wild cluster bootstrap or randomization inference essential | |
| CR1 SEs are unreliable; use bootstrap or RI |
Why CR1 Fails with Few Clusters
The cluster-robust variance estimator involves a sum over cluster-level "scores" (). With few clusters, this sum has high variance — the estimate of the variance is itself highly variable. The result is that:
-
Standard errors are biased downward. With few clusters, CR1 standard errors tend to be too small, leading to over-rejection of the null.
-
The t-distribution approximation is poor. Even with the degrees-of-freedom correction, the actual sampling distribution of the t-statistic is not well approximated by a t-distribution when is small.
-
Coverage of confidence intervals is too low. A 95% confidence interval based on CR1 may have actual coverage of 80% or worse with 10 clusters.
Cameron, Gelbach, and Miller (2008) showed through simulation that with clusters, the actual rejection rate of a 5% test using CR1 standard errors can be 15–25% — three to five times the nominal rate.
Wild Cluster Bootstrap
is the workhorse solution to the few-cluster problem. Proposed by Cameron, Gelbach, and Miller (2008), it provides more reliable inference than conventional cluster-robust standard errors when the number of clusters is small.
(Cameron et al., 2008)How It Works
The wild cluster bootstrap operates as follows:
-
Estimate the model under the null hypothesis . Obtain restricted residuals .
-
Generate bootstrap samples. For each bootstrap iteration :
- Draw a random weight for each cluster (Rademacher weights). All observations within cluster receive the same weight.
- Construct the bootstrap outcome: , where is the restricted predicted value.
- Re-estimate the model on and compute the t-statistic .
-
Compute the bootstrap p-value. The p-value is the fraction of bootstrap t-statistics that exceed the observed t-statistic in absolute value:
Why It Works Better
The wild cluster bootstrap resamples at the cluster level (the weight is constant within each cluster), preserving the within-cluster correlation structure. By imposing the null hypothesis and computing t-statistics (rather than raw coefficients), it provides an asymptotic refinement — the bootstrap distribution approximates the true finite-sample distribution of the test statistic more accurately than the approximation.
Webb Weights for Very Few Clusters
With very few clusters (), even Rademacher weights () produce too few distinct bootstrap datasets. Webb (2023) proposed a six-point distribution that provides more variation and better coverage with very small .
(Webb, 2023)Code: Wild Cluster Bootstrap
library(fwildclusterboot)
library(fixest)
# Step 1: Estimate the model with fixest
model <- feols(outcome ~ treatment + x1 + x2 | state_fe,
data = df,
cluster = ~state_id)
# Step 2: Wild cluster bootstrap
boot_result <- boottest(
model,
param = "treatment", # parameter to test
clustid = ~state_id, # cluster variable
B = 9999, # number of bootstrap iterations
type = "rademacher" # weight type ("rademacher", "webb")
)
# View results
summary(boot_result)
# Reports: bootstrap p-value and confidence interval
# For very few clusters, use Webb weights
boot_webb <- boottest(
model,
param = "treatment",
clustid = ~state_id,
B = 9999,
type = "webb"
)
summary(boot_webb)Alternative Solutions for Few Clusters
The wild cluster bootstrap is the most commonly used approach, but other solutions exist for specific settings:
Randomization Inference
When the treatment assignment mechanism is known — as in experiments or well-understood natural experiments — randomization inference provides exact p-values that do not depend on the number of clusters. Permute the treatment assignment across clusters and compute the test statistic for each permutation.
Effective Degrees of Freedom (Satterthwaite/BM)
Rather than using degrees of freedom, the Bell-McCaffrey (BM) correction estimates the effective degrees of freedom from the data, which can be much smaller than when clusters are unbalanced. Combined with CR2 standard errors, this approach (CR2 + BM) performs well even with moderate numbers of clusters.
(Pustejovsky & Tipton, 2018)Aggregation to the Cluster Level
A simple and sometimes underappreciated approach: if your treatment varies only at the cluster level, collapse the data to cluster-level means and run OLS on the collapsed data. With observations, you use standard (robust) standard errors. This approach is valid when cluster sizes are equal and treatment does not vary within clusters.
where is the cluster-level mean outcome. The drawback is that you lose the ability to control for individual-level covariates (which can improve precision) and that it is less efficient with unequal cluster sizes.
Code: Alternative Approaches
# --- CR2 + Satterthwaite degrees of freedom ---
library(clubSandwich)
model <- lm(outcome ~ treatment + x1 + x2 + factor(state_id),
data = df)
cr2_test <- coef_test(model,
vcov = vcovCR(model,
cluster = df$state_id,
type = "CR2"),
test = "Satterthwaite")
print(cr2_test)
# --- Aggregation to cluster level ---
df_collapsed <- df |>
dplyr::group_by(state_id) |>
dplyr::summarise(
outcome_mean = mean(outcome),
treatment = first(treatment),
x1_mean = mean(x1),
x2_mean = mean(x2)
)
model_collapsed <- lm(outcome_mean ~ treatment + x1_mean + x2_mean,
data = df_collapsed)
# Use HC2 robust SEs on collapsed data
library(sandwich)
library(lmtest)
coeftest(model_collapsed, vcov = vcovHC(model_collapsed, type = "HC2"))Two-Way Clustering
Sometimes residuals are correlated along two dimensions simultaneously. In panel data, residuals may be correlated within firms (across time) and within years (across firms). Two-way clustering accounts for both sources of dependence.
The two-way cluster-robust variance estimator uses the Cameron-Gelbach-Miller (2011) formula:
where clusters by the first dimension, clusters by the second, and clusters by the intersection of both dimensions.
Failure Demo
Naive SEs with Clustered Treatment
State-clustered standard errors (50 states)
SE = 0.45, t = 2.22, p = 0.031 — marginally significant. 95% CI: [0.10, 1.90]. Honest uncertainty about the effect.
Error Detective
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Concept Checks
You have data on 500,000 students in 40 school districts. A district-level policy was implemented in 20 of the 40 districts. You estimate the policy effect using student-level data and cluster standard errors at the school level (there are 2,000 schools). A colleague says you should cluster at the district level instead. Who is right, and why?
You run a wild cluster bootstrap and obtain a p-value of 0.087, while your conventional cluster-robust p-value is 0.032. Both test the same null hypothesis. Which should you trust more, and what does the discrepancy suggest?
Practical Decision Tree
When you sit down to estimate standard errors, work through these questions:
1. Does the treatment vary at a higher level than the unit of observation?
- Yes: You must cluster. Go to question 2.
- No: Heteroscedasticity-robust SEs are typically sufficient. Consider clustering if there is a sampling design that induces correlation.
2. At what level should you cluster?
- Cluster at the level of treatment variation, or higher if there is a reason (e.g., sampling design).
- When in doubt, cluster at a broader level. It is conservative but valid.
3. How many clusters do you have?
- : CR1 standard errors with critical values are generally fine.
- : Use CR2 + Satterthwaite degrees of freedom. Report wild cluster bootstrap as robustness.
- : Wild cluster bootstrap is strongly recommended. Report both CR1 and bootstrap results.
- : Bootstrap with Webb weights. Consider aggregation to cluster level or randomization inference.
4. Is there correlation along a second dimension?
- If residuals are correlated within both clusters (e.g., firms) and time periods (e.g., years), consider two-way clustering. Remember: the effective cluster count is the minimum of the two dimensions.
Common Mistakes
How to Report
A strong reporting practice includes:
Standard errors are clustered at the state level ( = 42), the level at which the policy varies. With 42 clusters, we supplement conventional cluster-robust inference with wild cluster bootstrap p-values (9,999 replications, Rademacher weights) following Cameron, Gelbach, and Miller (2008). The bootstrap p-value for the main treatment effect is 0.041, compared to the conventional clustered p-value of 0.028, confirming that our results are robust to few-cluster concerns.
Key elements to include:
- The level of clustering and why
- The number of clusters ()
- Wild cluster bootstrap results if
- Both conventional and bootstrap p-values for transparency
Paper Library
Foundational (3)
Pustejovsky, J. E., & Tipton, E. (2018). Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models.
Develops the CR2 bias-reduced cluster-robust variance estimator for fixed effects models with few clusters. The CR2 correction improves coverage relative to the standard CR1 estimator when the number of clusters is small.
Webb, M. D. (2023). Reworking Wild Bootstrap-Based Inference for Clustered Errors.
Introduces the Webb six-point distribution as an alternative to Rademacher weights for the wild cluster bootstrap. The Webb weights improve finite-sample performance when the number of clusters is very small.
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011). Robust Inference with Multiway Clustering.
Extends cluster-robust variance estimation to settings with two-way (or multi-way) clustering. The variance estimator adds the two one-way cluster-robust variance matrices and subtracts the heteroscedasticity-robust matrix.