MethodAtlas
Estimation Stage

Clustering and Few-Cluster Inference

When to cluster standard errors, at what level, and what to do when you have few clusters.

The Clustering Problem

You run a regression of student test scores on a school-level policy. You have 500,000 students in 50 states. Your standard errors look tight. Your p-value is 0.001. You feel confident.

You should not.

The policy varies at the state level. Students within the same state share the same policy, the same funding formula, the same regulatory environment. Their residuals are correlated — not because of anything you did wrong, but because the world is structured in groups. Ignoring that correlation makes your standard errors too small, your t-statistics too large, and your confidence intervals too narrow.

Within-cluster correlation is the clustering problem, and getting it wrong is one of the most common inferential errors in applied research.


The Moulton Factor: Why Naive SEs Fail

The intuition is simple. If observations within clusters are correlated, they contain less independent information than their raw count suggests. Your effective sample size is not the number of observations — it is closer to the number of clusters.

quantifies exactly how much naive standard errors understate true uncertainty. For a cluster-level regressor, the inflation factor is approximately:

Moulton Factor=1+(m1)ρ\text{Moulton Factor} = \sqrt{1 + (m - 1)\rho}

where mm is the average cluster size and ρ\rho is the — the fraction of total residual variance that is between-cluster rather than within-cluster.

The numbers are startling. Even a modest intraclass correlation of ρ=0.05\rho = 0.05 with an average cluster size of m=50m = 50 yields:

Moulton Factor=1+49×0.05=3.451.86\text{Moulton Factor} = \sqrt{1 + 49 \times 0.05} = \sqrt{3.45} \approx 1.86

Your naive standard errors are 86% too small. A t-statistic of 2.5 — seemingly significant at the 1% level — becomes 1.34 after correction. Not significant at any conventional level.

(Bertrand et al., 2004)

Why Does Correlation Arise?

Within-cluster correlation of residuals arises from several sources:

  1. Shared treatments. If a policy is implemented at the state level, all individuals in the state experience the same treatment shock. Any state-level unobservable that affects outcomes creates residual correlation.

  2. Shared environments. Students in the same school share teachers, resources, and peer effects. Workers in the same firm share management quality and organizational culture.

  3. Sampling design. If you sample clusters (schools, villages) and then sample individuals within clusters, the sampling itself induces correlation.

  4. Serial correlation. In panel data, the same unit observed over time has correlated residuals across periods. Serial correlation is clustering over time within units.


When to Cluster: The Abadie et al. Framework

Not all regressions require clustered standard errors. The decision of whether and how to cluster is not a mechanical one — it depends on the research design, the source of variation, and the sampling process.

(Abadie et al., 2023)

Abadie, Athey, Imbens, and Wooldridge (2023) provide a principled framework. Their key insight is that clustering addresses two distinct problems, and you should cluster when either applies:

Reason 1: Clustered Treatment Assignment

If treatment is assigned at the cluster level (e.g., a policy enacted at the state level), you must cluster at least at that level. The assignment mechanism induces correlation in treatment status within clusters, which in turn creates correlation in residuals.

Rule: Cluster at the level of treatment assignment or higher.

Reason 2: Clustered Sampling

If you sample clusters from a population of clusters (e.g., you sample 50 schools from 5,000 schools in the country), you should cluster at the sampling level to account for the fact that your clusters are a random draw.

Rule: If clusters are sampled, cluster at the sampling level.

When Not to Cluster

If treatment varies at the individual level and you observe the full population (or a simple random sample) of clusters, clustering may not be necessary. In this case, heteroscedasticity-robust standard errors suffice.

SettingCluster?Level
State-level policy, individual dataYesState
School-level randomization, student outcomesYesSchool
Individual-level randomization, no clustering in samplingNot required— (use robust SEs)
Firm-level panel, firm FEYesFirm (at minimum)
DiD with state-level treatment variationYesState
Multi-site RCT, randomized within siteDependsSite (if sites are sampled from a larger population)
(Cameron & Miller, 2015)

Cluster-Robust Standard Errors

are computed via a sandwich estimator that allows for arbitrary within-cluster correlation. The key formula for the variance of the OLS estimator β^\hat{\beta} is:

V^CR=(XX)1(g=1GXgu^gu^gXg)(XX)1\widehat{V}_{CR} = (X'X)^{-1} \left( \sum_{g=1}^{G} X_g' \hat{u}_g \hat{u}_g' X_g \right) (X'X)^{-1}

where g=1,,Gg = 1, \ldots, G indexes clusters, XgX_g is the matrix of regressors for cluster gg, and u^g\hat{u}_g is the vector of residuals for cluster gg.

CR0 vs. CR1 vs. CR2

Different variants apply different finite-sample corrections:

CR0 (no correction): The formula above, used by Stata's default cluster() option (with a degrees-of-freedom adjustment of G/(G1)(N1)/(NK)G/(G-1) \cdot (N-1)/(N-K), making it effectively CR1).

CR1 (degrees-of-freedom correction): Multiplies the variance by G/(G1)G/(G-1), analogous to the HC1 correction for heteroscedasticity-robust standard errors. CR1 is the most commonly used variant in practice.

CR2 (bias-corrected): Applies a cluster-level leverage correction analogous to HC2. The CR2 estimator uses:

V^CR2=(XX)1(g=1GXgAgu^gu^gAgXg)(XX)1\widehat{V}_{CR2} = (X'X)^{-1} \left( \sum_{g=1}^{G} X_g' A_g \hat{u}_g \hat{u}_g' A_g' X_g \right) (X'X)^{-1}

where AgA_g is a correction matrix based on the cluster-level hat matrix. CR2 is less biased than CR1 with few clusters and is particularly recommended when cluster sizes are unbalanced.

(Pustejovsky & Tipton, 2018)

Degrees of Freedom

With clustered standard errors, the relevant degrees of freedom for inference is approximately G1G - 1 (the number of clusters minus one), not NKN - K (observations minus parameters). This matters enormously when GG is small. A t-statistic of 2.1 is significant at 5% with 100 degrees of freedom but not with 10 degrees of freedom (critical value \approx 2.23).


Code: Cluster-Robust Standard Errors

library(fixest)
library(clubSandwich)

# --- CR1 (default in fixest) ---
model <- feols(outcome ~ treatment + x1 + x2 | state_fe,
             data = df,
             cluster = ~state_id)
summary(model)  # reports CR1 clustered SEs

# --- CR2 (bias-corrected, via clubSandwich) ---
model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(state_fe),
             data = df)
cr2_vcov <- vcovCR(model_lm, cluster = df$state_id, type = "CR2")
coef_test(model_lm, vcov = cr2_vcov, test = "Satterthwaite")

# --- Two-way clustering (state and year) ---
model_2way <- feols(outcome ~ treatment + x1 + x2 | state_fe + year_fe,
                  data = df,
                  cluster = ~state_id + year_id)
summary(model_2way)
Requiresfixest

The Few-Cluster Problem

Cluster-robust standard errors have an asymptotic justification: they are consistent as the number of clusters GG \to \infty. But what happens when GG is small?

The answer: things break down. And "small" starts sooner than you might think.

What Counts as "Few"?

There is no sharp cutoff, but the empirical evidence is clear:

Number of Clusters (GG)Reliability of CR1 SEs
G50G \geq 50Generally reliable
30G<5030 \leq G < 50Moderate concern; CR2 + Satterthwaite df helps
20G<3020 \leq G < 30Substantial concern; wild cluster bootstrap recommended
G<20G < 20Serious concern; wild cluster bootstrap or randomization inference essential
G<10G < 10CR1 SEs are unreliable; use bootstrap or RI
(Cameron et al., 2008)

Why CR1 Fails with Few Clusters

The cluster-robust variance estimator involves a sum over GG cluster-level "scores" (Xgu^gX_g' \hat{u}_g). With few clusters, this sum has high variance — the estimate of the variance is itself highly variable. The result is that:

  1. Standard errors are biased downward. With few clusters, CR1 standard errors tend to be too small, leading to over-rejection of the null.

  2. The t-distribution approximation is poor. Even with the G1G - 1 degrees-of-freedom correction, the actual sampling distribution of the t-statistic is not well approximated by a t-distribution when GG is small.

  3. Coverage of confidence intervals is too low. A 95% confidence interval based on CR1 may have actual coverage of 80% or worse with 10 clusters.

Cameron, Gelbach, and Miller (2008) showed through simulation that with G=10G = 10 clusters, the actual rejection rate of a 5% test using CR1 standard errors can be 15–25% — three to five times the nominal rate.


Wild Cluster Bootstrap

is the workhorse solution to the few-cluster problem. Proposed by Cameron, Gelbach, and Miller (2008), it provides more reliable inference than conventional cluster-robust standard errors when the number of clusters is small.

(Cameron et al., 2008)

How It Works

The wild cluster bootstrap operates as follows:

  1. Estimate the model under the null hypothesis H0:β1=0H_0: \beta_1 = 0. Obtain restricted residuals u~i\tilde{u}_i.

  2. Generate bootstrap samples. For each bootstrap iteration b=1,,Bb = 1, \ldots, B:

    • Draw a random weight wg(b){1,+1}w_g^{(b)} \in \{-1, +1\} for each cluster gg (Rademacher weights). All observations within cluster gg receive the same weight.
    • Construct the bootstrap outcome: yi(b)=y^iR+wg(i)(b)u~iy_i^{(b)} = \hat{y}_i^{R} + w_{g(i)}^{(b)} \cdot \tilde{u}_i, where y^iR\hat{y}_i^{R} is the restricted predicted value.
    • Re-estimate the model on (yi(b),Xi)(y_i^{(b)}, X_i) and compute the t-statistic t(b)t^{(b)}.
  3. Compute the bootstrap p-value. The p-value is the fraction of bootstrap t-statistics that exceed the observed t-statistic in absolute value:

pWCB=1Bb=1B1{t(b)tobs}p_{WCB} = \frac{1}{B} \sum_{b=1}^{B} \mathbf{1}\{|t^{(b)}| \geq |t_{\text{obs}}|\}

Why It Works Better

The wild cluster bootstrap resamples at the cluster level (the weight is constant within each cluster), preserving the within-cluster correlation structure. By imposing the null hypothesis and computing t-statistics (rather than raw coefficients), it provides an asymptotic refinement — the bootstrap distribution approximates the true finite-sample distribution of the test statistic more accurately than the tG1t_{G-1} approximation.

Webb Weights for Very Few Clusters

With very few clusters (G<10G < 10), even Rademacher weights (±1\pm 1) produce too few distinct bootstrap datasets. Webb (2023) proposed a six-point distribution {3/2,2/2,1/2,1/2,2/2,3/2}\{-\sqrt{3/2}, -\sqrt{2/2}, -\sqrt{1/2}, \sqrt{1/2}, \sqrt{2/2}, \sqrt{3/2}\} that provides more variation and better coverage with very small GG.

(Webb, 2023)

Code: Wild Cluster Bootstrap

library(fwildclusterboot)
library(fixest)

# Step 1: Estimate the model with fixest
model <- feols(outcome ~ treatment + x1 + x2 | state_fe,
             data = df,
             cluster = ~state_id)

# Step 2: Wild cluster bootstrap
boot_result <- boottest(
model,
param = "treatment",     # parameter to test
clustid = ~state_id,     # cluster variable
B = 9999,                # number of bootstrap iterations
type = "rademacher"      # weight type ("rademacher", "webb")
)

# View results
summary(boot_result)
# Reports: bootstrap p-value and confidence interval

# For very few clusters, use Webb weights
boot_webb <- boottest(
model,
param = "treatment",
clustid = ~state_id,
B = 9999,
type = "webb"
)
summary(boot_webb)
Requiresfixest

Alternative Solutions for Few Clusters

The wild cluster bootstrap is the most commonly used approach, but other solutions exist for specific settings:

Randomization Inference

When the treatment assignment mechanism is known — as in experiments or well-understood natural experiments — randomization inference provides exact p-values that do not depend on the number of clusters. Permute the treatment assignment across clusters and compute the test statistic for each permutation.

Effective Degrees of Freedom (Satterthwaite/BM)

Rather than using G1G - 1 degrees of freedom, the Bell-McCaffrey (BM) correction estimates the effective degrees of freedom from the data, which can be much smaller than G1G - 1 when clusters are unbalanced. Combined with CR2 standard errors, this approach (CR2 + BM) performs well even with moderate numbers of clusters.

(Pustejovsky & Tipton, 2018)

Aggregation to the Cluster Level

A simple and sometimes underappreciated approach: if your treatment varies only at the cluster level, collapse the data to cluster-level means and run OLS on the collapsed data. With GG observations, you use standard (robust) standard errors. This approach is valid when cluster sizes are equal and treatment does not vary within clusters.

Yˉg=α+βDg+Xˉgγ+ϵg\bar{Y}_g = \alpha + \beta \cdot D_g + \bar{X}_g'\gamma + \epsilon_g

where Yˉg\bar{Y}_g is the cluster-level mean outcome. The drawback is that you lose the ability to control for individual-level covariates (which can improve precision) and that it is less efficient with unequal cluster sizes.


Code: Alternative Approaches

# --- CR2 + Satterthwaite degrees of freedom ---
library(clubSandwich)

model <- lm(outcome ~ treatment + x1 + x2 + factor(state_id),
          data = df)
cr2_test <- coef_test(model,
                    vcov = vcovCR(model,
                                  cluster = df$state_id,
                                  type = "CR2"),
                    test = "Satterthwaite")
print(cr2_test)

# --- Aggregation to cluster level ---
df_collapsed <- df |>
dplyr::group_by(state_id) |>
dplyr::summarise(
  outcome_mean = mean(outcome),
  treatment = first(treatment),
  x1_mean = mean(x1),
  x2_mean = mean(x2)
)

model_collapsed <- lm(outcome_mean ~ treatment + x1_mean + x2_mean,
                     data = df_collapsed)
# Use HC2 robust SEs on collapsed data
library(sandwich)
library(lmtest)
coeftest(model_collapsed, vcov = vcovHC(model_collapsed, type = "HC2"))

Two-Way Clustering

Sometimes residuals are correlated along two dimensions simultaneously. In panel data, residuals may be correlated within firms (across time) and within years (across firms). Two-way clustering accounts for both sources of dependence.

The two-way cluster-robust variance estimator uses the Cameron-Gelbach-Miller (2011) formula:

V^2-way=V^cluster1+V^cluster2V^intersection\widehat{V}_{2\text{-way}} = \widehat{V}_{\text{cluster}_1} + \widehat{V}_{\text{cluster}_2} - \widehat{V}_{\text{intersection}}

where V^cluster1\widehat{V}_{\text{cluster}_1} clusters by the first dimension, V^cluster2\widehat{V}_{\text{cluster}_2} clusters by the second, and V^intersection\widehat{V}_{\text{intersection}} clusters by the intersection of both dimensions.


Failure Demo

Assumption Failure Demo

Naive SEs with Clustered Treatment

State-clustered standard errors (50 states)

SE = 0.45, t = 2.22, p = 0.031 — marginally significant. 95% CI: [0.10, 1.90]. Honest uncertainty about the effect.


Error Detective

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the effect of a state-level minimum wage increase on teen employment. They have individual-level data from 50 states over 10 years (2010-2019). They run: reghdfe employment min_wage_increase age education, absorb(state year) vce(cluster county) They report: "The minimum wage increase reduces teen employment by 1.2 percentage points (SE = 0.3, p < 0.001). We cluster standard errors at the county level to account for geographic correlation in employment patterns."

Select all errors you can find:


Concept Checks

Concept Check

You have data on 500,000 students in 40 school districts. A district-level policy was implemented in 20 of the 40 districts. You estimate the policy effect using student-level data and cluster standard errors at the school level (there are 2,000 schools). A colleague says you should cluster at the district level instead. Who is right, and why?

Concept Check

You run a wild cluster bootstrap and obtain a p-value of 0.087, while your conventional cluster-robust p-value is 0.032. Both test the same null hypothesis. Which should you trust more, and what does the discrepancy suggest?


Practical Decision Tree

When you sit down to estimate standard errors, work through these questions:

1. Does the treatment vary at a higher level than the unit of observation?

  • Yes: You must cluster. Go to question 2.
  • No: Heteroscedasticity-robust SEs are typically sufficient. Consider clustering if there is a sampling design that induces correlation.

2. At what level should you cluster?

  • Cluster at the level of treatment variation, or higher if there is a reason (e.g., sampling design).
  • When in doubt, cluster at a broader level. It is conservative but valid.

3. How many clusters do you have?

  • G50G \geq 50: CR1 standard errors with tG1t_{G-1} critical values are generally fine.
  • 30G<5030 \leq G < 50: Use CR2 + Satterthwaite degrees of freedom. Report wild cluster bootstrap as robustness.
  • G<30G < 30: Wild cluster bootstrap is strongly recommended. Report both CR1 and bootstrap results.
  • G<10G < 10: Bootstrap with Webb weights. Consider aggregation to cluster level or randomization inference.

4. Is there correlation along a second dimension?

  • If residuals are correlated within both clusters (e.g., firms) and time periods (e.g., years), consider two-way clustering. Remember: the effective cluster count is the minimum of the two dimensions.

Common Mistakes


How to Report

A strong reporting practice includes:

Standard errors are clustered at the state level (GG = 42), the level at which the policy varies. With 42 clusters, we supplement conventional cluster-robust inference with wild cluster bootstrap p-values (9,999 replications, Rademacher weights) following Cameron, Gelbach, and Miller (2008). The bootstrap p-value for the main treatment effect is 0.041, compared to the conventional clustered p-value of 0.028, confirming that our results are robust to few-cluster concerns.

Key elements to include:

  • The level of clustering and why
  • The number of clusters (GG)
  • Wild cluster bootstrap results if G<50G < 50
  • Both conventional and bootstrap p-values for transparency

Paper Library

Foundational (3)

Pustejovsky, J. E., & Tipton, E. (2018). Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models.

Journal of Business & Economic StatisticsDOI: 10.1080/07350015.2016.1247004

Develops the CR2 bias-reduced cluster-robust variance estimator for fixed effects models with few clusters. The CR2 correction improves coverage relative to the standard CR1 estimator when the number of clusters is small.

Webb, M. D. (2023). Reworking Wild Bootstrap-Based Inference for Clustered Errors.

Canadian Journal of EconomicsDOI: 10.1111/caje.12666

Introduces the Webb six-point distribution as an alternative to Rademacher weights for the wild cluster bootstrap. The Webb weights improve finite-sample performance when the number of clusters is very small.

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011). Robust Inference with Multiway Clustering.

Journal of Business & Economic StatisticsDOI: 10.1198/jbes.2010.07136

Extends cluster-robust variance estimation to settings with two-way (or multi-way) clustering. The variance estimator adds the two one-way cluster-robust variance matrices and subtracts the heteroscedasticity-robust matrix.