Practice·Estimation Stage·12 min read

Estimation Stage

Clustering and Few-Cluster Inference

When to cluster standard errors, at what level, and what to do when you have few clusters.

Applies To: OLS (Robust SEs, Clustering), Difference-in-Differences (Canonical 2×2), Staggered DiD, Instrumental Variables / 2SLS, Experimental Design, Fixed Effects (Two-Way FE), Event Studies (Dynamic Treatment Effects)
Reading Time: ~12 min read · 17 sections · 3 interactive exercises · 3 papers

The Clustering Problem

You run a regression of student test scores on a school-level policy. You have 500,000 students in 50 states. Your standard errors look tight. Your p-value is 0.001. You feel confident.

Reconsider.

The policy varies at the state level. Students within the same state share the same policy, the same funding formula, the same regulatory environment. Their residuals are correlated — not because of anything you did wrong, but because the world is structured in groups. Ignoring that correlation makes your standard errors too small, your t-statistics too large, and your confidence intervals too narrow.

Within-cluster correlation is the clustering problem, and getting it wrong is one of the most common inferential errors in applied research.

The Moulton Factor: Why Naive SEs Fail

The intuition is simple. If observations within clusters are correlated, they contain less independent information than their raw count suggests. Your effective sample size is not the number of observations — it is closer to the number of clusters.

quantifies exactly how much naive standard errors understate true uncertainty. For a cluster-level regressor, the inflation factor is approximately:

\text{Moulton Factor} = \sqrt{1 + (m - 1)\rho}

where $m$ is the average cluster size and $\rho$ is the — the fraction of total residual variance that is between-cluster rather than within-cluster.

The numbers are startling. Even a modest intraclass correlation of $\rho = 0.05$ with an average cluster size of $m = 50$ yields:

\text{Moulton Factor} = \sqrt{1 + 49 \times 0.05} = \sqrt{3.45} \approx 1.86

The corrected standard errors are 86% larger than the naive ones. A t-statistic of 2.5 — seemingly significant at the 5% level — becomes 1.34 after correction. Not significant at any conventional level.

Why Does Correlation Arise?

Within-cluster correlation of residuals arises from several sources:

Shared treatments. If a policy is implemented at the state level, all individuals in the state experience the same treatment shock. Any state-level unobservable that affects outcomes creates residual correlation.
Shared environments. Students in the same school share teachers, resources, and peer effects. Workers in the same firm share management quality and organizational culture.
Sampling design. If you sample clusters (schools, villages) and then sample individuals within clusters, the sampling itself induces correlation.
Serial correlation. In panel data, the same unit observed over time has correlated residuals across periods. Serial correlation is clustering over time within units.

When to Cluster: The Abadie et al. Framework

Not all regressions require clustered standard errors. The decision of whether and how to cluster is not a mechanical one — it depends on the research design, the source of variation, and the sampling process.

Abadie et al. (2023) provide a principled framework. Their key insight is that clustering addresses two distinct problems, and you generally want to cluster when either applies:

Reason 1: Clustered Treatment Assignment

If treatment is assigned at the cluster level (e.g., a policy enacted at the state level), the assignment mechanism induces correlation in treatment status within clusters, which in turn creates correlation in residuals. Abadie et al. show that — under cluster-level assignment with a non-trivial fraction of clusters sampled from a target population — design-based inference requires accounting for this correlation, which is what cluster-robust SEs do.

Standard recommendation under cluster-level assignment: Cluster at the level of treatment assignment or higher. This recommendation follows from the Abadie et al. framework when the assignment mechanism is clustered; it is not a free-standing rule.

Reason 2: Clustered Sampling

If you sample clusters from a population of clusters (e.g., you sample 50 schools from 5,000 schools in the country), you generally want to cluster at the sampling level to account for the fact that your clusters are a random draw.

Standard recommendation under clustered sampling: Cluster at the sampling level. As with cluster-level assignment, this recommendation comes from the design-based justification in Abadie et al. (2023) rather than a generic rule that applies to every regression.

When Not to Cluster

If treatment varies at the individual level and you observe the full population (or a simple random sample) of clusters, clustering may not be necessary. In this case, heteroscedasticity-robust standard errors suffice.

Setting	Cluster?	Level
State-level policy, individual data	Yes	State
School-level randomization, student outcomes	Yes	School
Individual-level randomization, no clustering in sampling	Not required	— (use robust SEs)
Firm-level panel, firm FE	Yes	Firm (at minimum)
DiD with state-level treatment variation	Yes	State
Multi-site RCT, randomized within site	Depends	Site (if sites are sampled from a larger population)

Cluster-Robust Standard Errors

are computed via a sandwich estimator that allows for arbitrary within-cluster correlation. The key formula for the variance of the OLS estimator $\hat{\beta}$ is:

\widehat{V}_{CR} = (X'X)^{-1} \left( \sum_{g=1}^{G} X_g' \hat{u}_g \hat{u}_g' X_g \right) (X'X)^{-1}

where $g = 1, \ldots, G$ indexes clusters, $X_g$ is the matrix of regressors for cluster $g$ , and $\hat{u}_g$ is the vector of residuals for cluster $g$ .

CR0 vs. CR1 vs. CR2

Different variants apply different finite-sample corrections:

CR0 (no correction): The raw sandwich formula above, without any finite-sample adjustment. Not commonly used in practice because it tends to underestimate the true variance with few clusters.

CR1 (degrees-of-freedom correction): Multiplies the variance by $\frac{G}{G-1} \cdot \frac{N-1}{N-K}$ , analogous to the HC1 correction for heteroscedasticity-robust standard errors. The CR1 correction is Stata's default vce(cluster) correction and the convention used by sandwich, fixest, and clubSandwich in R. Some texts define a stripped-down CR1 as just $G/(G-1)$ ; the full adjustment with the $(N-1)/(N-K)$ factor is the practical default. CR1 is the most commonly used variant.

CR2 (bias-corrected): Applies a cluster-level leverage correction analogous to HC2. The CR2 estimator uses:

\widehat{V}_{CR2} = (X'X)^{-1} \left( \sum_{g=1}^{G} X_g' A_g \hat{u}_g \hat{u}_g' A_g' X_g \right) (X'X)^{-1}

where $A_g$ is a correction matrix based on the cluster-level hat matrix. CR2 is less biased than CR1 with few clusters and is particularly recommended when cluster sizes are unbalanced (Pustejovsky & Tipton, 2018).

Degrees of Freedom

With clustered standard errors, the relevant degrees of freedom for inference is approximately $G - 1$ (the number of clusters minus one), not $N - K$ (observations minus parameters). This distinction matters enormously when $G$ is small. A t-statistic of 2.1 is significant at 5% with 100 degrees of freedom but not with 10 degrees of freedom (critical value $\approx$ 2.23). The $G-1$ rule of thumb can still be too optimistic when cluster sizes are highly unbalanced or when the regressor of interest varies primarily across a few clusters; the Bell-McCaffrey / Imbens-Kolesár data-dependent effective degrees of freedom (used by CR2 in clubSandwich) is typically smaller than $G-1$ in those settings.

Code: Cluster-Robust Standard Errors

1# --- Step 1: Load packages ---
2# fixest: fast fixed-effects estimation with built-in clustering
3# clubSandwich: bias-corrected CR2 standard errors for few clusters
4library(fixest)
5library(clubSandwich)
6
7# --- Step 2: CR1 clustered standard errors (default in fixest) ---
8# Cluster at the state level because treatment varies at the state level
9model <- feols(outcome ~ treatment + x1 + x2 | state_fe,
10             data = df,
11             cluster = ~state_id)
12# summary() reports CR1 SEs, t-stats, and p-values with G-1 df
13summary(model)
14
15# --- Step 3: CR2 bias-corrected SEs (via clubSandwich) ---
16# CR2 reduces downward bias of CR1 when clusters are few or unbalanced
17model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(state_fe),
18             data = df)
19# vcovCR computes the CR2 variance-covariance matrix
20cr2_vcov <- vcovCR(model_lm, cluster = df$state_id, type = "CR2")
21# Satterthwaite df adjusts critical values for few-cluster settings
22coef_test(model_lm, vcov = cr2_vcov, test = "Satterthwaite")
23
24# --- Step 4: Two-way clustering (state and year) ---
25# Use when residuals are correlated within states AND within years
26model_2way <- feols(outcome ~ treatment + x1 + x2 | state_fe + year_fe,
27                  data = df,
28                  cluster = ~state_id + year_id)
29# Effective cluster count is min(states, years)
30summary(model_2way)

Requiresfixest

The Few-Cluster Problem

Cluster-robust standard errors have an asymptotic justification: they are consistent as the number of clusters $G \to \infty$ . But what happens when $G$ is small?

The answer: things break down. And "small" starts sooner than you might think.

What Counts as "Few"?

There is no sharp cutoff, but the empirical evidence is clear (Cameron et al., 2008):

Number of Clusters ( $G$ )	Reliability of CR1 SEs
$G \geq 50$	Generally reliable
$30 \leq G < 50$	Moderate concern; CR2 + Satterthwaite df helps
$20 \leq G < 30$	Substantial concern; wild cluster bootstrap recommended
$G < 20$	Serious concern; wild cluster bootstrap or randomization inference strongly recommended
$G < 10$	CR1 SEs are unreliable; use bootstrap or RI

Why CR1 Fails with Few Clusters

The cluster-robust variance estimator involves a sum over $G$ cluster-level "scores" ( $X_g' \hat{u}_g$ ). With few clusters, this sum has high variance — the estimate of the variance is itself highly variable. The result is that:

Standard errors are biased downward. With few clusters, CR1 standard errors tend to be too small, leading to over-rejection of the null.
The t-distribution approximation is poor. Even with the $G - 1$ degrees-of-freedom correction, the actual sampling distribution of the t-statistic is not well approximated by a t-distribution when $G$ is small.
Coverage of confidence intervals is too low. A 95% confidence interval based on CR1 may have actual coverage of 80% or worse with 10 clusters.

Cameron et al. (2008) showed through simulation that with $G = 10$ clusters, the actual rejection rate of a 5% test using CR1 standard errors can be 15–25% — three to five times the nominal rate.

Visualize how bootstrap resampling constructs confidence intervals. Compare three CI methods — percentile, normal, and basic — and see how sample size and noise affect coverage:

Wild Cluster Bootstrap

is the workhorse solution to the few-cluster problem. Proposed by Cameron et al. (2008), it provides more reliable inference than conventional cluster-robust standard errors when the number of clusters is small.

How It Works

The wild cluster bootstrap operates as follows:

Estimate the model under the null hypothesis $H_0: \beta_1 = 0$ . Obtain restricted residuals $\tilde{u}_i$ .
Generate bootstrap samples. For each bootstrap iteration $b = 1, \ldots, B$ :
- Draw a random weight $w_g^{(b)} \in \{-1, +1\}$ for each cluster $g$ (Rademacher weights). All observations within cluster $g$ receive the same weight.
- Construct the bootstrap outcome: $y_i^{(b)} = \hat{y}_i^{R} + w_{g(i)}^{(b)} \cdot \tilde{u}_i$ , where $\hat{y}_i^{R}$ is the restricted predicted value.
- Re-estimate the model on $(y_i^{(b)}, X_i)$ and compute the t-statistic $t^{(b)}$ .
Compute the bootstrap p-value. The p-value is the fraction of bootstrap t-statistics that exceed the observed t-statistic in absolute value:

p_{WCB} = \frac{1}{B} \sum_{b=1}^{B} \mathbf{1}(|t^{(b)}| \geq |t_{\text{obs}}|)

Why It Works Better

The wild cluster bootstrap resamples at the cluster level (the weight is constant within each cluster), preserving the within-cluster correlation structure. By imposing the null hypothesis and computing t-statistics (rather than raw coefficients), it provides an asymptotic refinement — the bootstrap distribution approximates the true finite-sample distribution of the test statistic more accurately than the $t_{G-1}$ approximation.

Webb Weights for Very Few Clusters

With very few clusters ( $G < 10$ ), even Rademacher weights ( $\pm 1$ ) produce too few distinct bootstrap datasets. Webb (2023) proposed a six-point distribution $\{-\sqrt{3/2}, -1, -\sqrt{1/2}, \sqrt{1/2}, 1, \sqrt{3/2}\}$ that provides more variation and better coverage with very small $G$ .

Code: Wild Cluster Bootstrap

1# --- Step 1: Load packages ---
2# fwildclusterboot: implements the wild cluster bootstrap of
3#   Cameron, Gelbach, and Miller (2008)
4library(fwildclusterboot)
5library(fixest)
6
7# --- Step 2: Estimate the baseline model ---
8# First fit the model with CR1 SEs to get the coefficient
9model <- feols(outcome ~ treatment + x1 + x2 | state_fe,
10             data = df,
11             cluster = ~state_id)
12
13# --- Step 3: Wild cluster bootstrap with Rademacher weights ---
14# boottest() resamples at the cluster level, preserving
15# within-cluster correlation structure
16boot_result <- boottest(
17model,
18param = "treatment",     # coefficient to test under H0: beta = 0
19clustid = ~state_id,     # cluster variable (must match model)
20B = 9999,                # more iterations = more precise p-value
21type = "rademacher"      # +/- 1 weights (default, good for G >= 10)
22)
23
24# Reports: bootstrap p-value and confidence interval
25# Compare the bootstrap p-value to the CR1 p-value above
26summary(boot_result)
27
28# --- Step 4: Webb weights for very few clusters (G < 10) ---
29# Webb (2023) six-point weights provide more variation than Rademacher
30# when G is very small, improving coverage
31boot_webb <- boottest(
32model,
33param = "treatment",
34clustid = ~state_id,
35B = 9999,
36type = "webb"            # recommended when G < 10
37)
38summary(boot_webb)

Requiresfixest

Alternative Solutions for Few Clusters

The wild cluster bootstrap is the most commonly used approach, but other solutions exist for specific settings:

Randomization Inference

When the treatment assignment mechanism is known — as in experiments or well-understood natural experiments — randomization inference provides exact p-values that do not depend on the number of clusters. Permute the treatment assignment across clusters and compute the test statistic for each permutation.

Effective Degrees of Freedom (Satterthwaite/BM)

Rather than using $G - 1$ degrees of freedom, the Bell-McCaffrey (BM) correction estimates the effective degrees of freedom from the data, which can be much smaller than $G - 1$ when clusters are unbalanced. Combined with CR2 standard errors, this approach (CR2 + BM) performs well even with moderate numbers of clusters (Pustejovsky & Tipton, 2018).

Aggregation to the Cluster Level

A simple and sometimes underappreciated approach: if your treatment varies only at the cluster level, collapse the data to cluster-level means and run OLS on the collapsed data. With $G$ observations, you use standard (robust) standard errors. Aggregation is valid when cluster sizes are equal and treatment does not vary within clusters.

\bar{Y}_g = \alpha + \beta \cdot D_g + \bar{X}_g'\gamma + \varepsilon_g

where $\bar{Y}_g$ is the cluster-level mean outcome. The drawback is that you lose the ability to control for individual-level covariates (which can improve precision) and that it is less efficient with unequal cluster sizes.

Code: Alternative Approaches

1# --- Step 1: CR2 + Satterthwaite degrees of freedom ---
2# clubSandwich provides bias-corrected CR2 SEs for few-cluster settings
3library(clubSandwich)
4
5# Fit OLS with state dummies (lm, not feols, for clubSandwich compatibility)
6model <- lm(outcome ~ treatment + x1 + x2 + factor(state_id),
7          data = df)
8# vcovCR computes the CR2 variance matrix; Satterthwaite adjusts df
9cr2_test <- coef_test(model,
10                    vcov = vcovCR(model,
11                                  cluster = df$state_id,
12                                  type = "CR2"),
13                    test = "Satterthwaite")
14# Output includes effective df — if much less than G-1, clusters are unbalanced
15print(cr2_test)
16
17# --- Step 2: Aggregation to cluster level ---
18# Collapse individual data to cluster means, then run OLS on G observations
19# Valid when treatment is constant within clusters and cluster sizes are equal
20df_collapsed <- df |>
21dplyr::group_by(state_id) |>
22dplyr::summarise(
23  outcome_mean = mean(outcome),     # cluster-level mean outcome
24  treatment = first(treatment),     # treatment is constant within cluster
25  x1_mean = mean(x1),              # average covariate values
26  x2_mean = mean(x2)
27)
28
29model_collapsed <- lm(outcome_mean ~ treatment + x1_mean + x2_mean,
30                     data = df_collapsed)
31# HC2 robust SEs on collapsed data — no clustering needed since each row = 1 cluster
32library(sandwich)
33library(lmtest)
34coeftest(model_collapsed, vcov = vcovHC(model_collapsed, type = "HC2"))

Requiresdplyr sandwich lmtest

Two-Way Clustering

Sometimes residuals are correlated along two dimensions simultaneously. In panel data, residuals may be correlated within firms (across time) and within years (across firms). Two-way clustering accounts for both sources of dependence.

The two-way cluster-robust variance estimator uses the Cameron et al. (2011) formula:

\widehat{V}_{2\text{-way}} = \widehat{V}_{\text{cluster}_1} + \widehat{V}_{\text{cluster}_2} - \widehat{V}_{\text{intersection}}

where $\widehat{V}_{\text{cluster}_1}$ clusters by the first dimension, $\widehat{V}_{\text{cluster}_2}$ clusters by the second, and $\widehat{V}_{\text{intersection}}$ clusters by the intersection of both dimensions.

Failure Demo

What Can Go Wrong

Naive SEs with Clustered Treatment

State-clustered standard errors (50 states)

SE = 0.45, t = 2.22, p = 0.031 — marginally significant. 95% CI: [0.10, 1.90]. Honest uncertainty about the effect.

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the effect of a state-level minimum wage increase on teen employment. They have individual-level data from 50 states over 10 years (2010-2019). They run:

reghdfe employment min_wage_increase age education, absorb(state year) vce(cluster county)

They report: "The minimum wage increase reduces teen employment by 1.2 percentage points (SE = 0.3, p < 0.001). We cluster standard errors at the county level to account for geographic correlation in employment patterns."

Select all errors you can find:

Clustering at the wrong level — county instead of state(vce(cluster county))

Not reporting wild cluster bootstrap as a robustness check(Inference approach)

Not considering two-way clustering(Clustering specification)

Concept Checks

Concept Check

You have data on 500,000 students in 40 school districts. A district-level policy was implemented in 20 of the 40 districts. You estimate the policy effect using student-level data and cluster standard errors at the school level (there are 2,000 schools). A colleague says you should cluster at the district level instead. Who is right, and why?

You are right — more clusters (2,000 schools) give more precise standard errors.Your colleague is right — standard errors should be clustered at the district level (the level of treatment variation), even though this gives only 40 clusters.Neither — you generally want to use heteroscedasticity-robust standard errors without clustering since you have 500,000 observations.Your colleague is right in principle, but 40 clusters is too few for any valid inference.

Concept Check

You run a wild cluster bootstrap and obtain a p-value of 0.087, while your conventional cluster-robust p-value is 0.032. Both test the same null hypothesis. Which should you trust more, and what does the discrepancy suggest?

Trust the conventional p-value — it is based on a well-established formula.Trust the bootstrap p-value — the discrepancy suggests the cluster-robust SEs are too small, which is the typical direction of bias with few clusters.Average the two p-values and report 0.060.The discrepancy means the model is misspecified, so neither p-value is trustworthy.

Practical Decision Tree

When you sit down to estimate standard errors, work through these questions:

1. Does the treatment vary at a higher level than the unit of observation, or is the sample a cluster sample from a larger population?

Yes to either: The standard recommendation, following Abadie et al. (2023), is to cluster. Go to question 2.
No to both: Heteroscedasticity-robust SEs are typically sufficient. Clustering is generally not required when treatment varies at the unit level and the sample is not a cluster sample.

2. At what level should you cluster?

Cluster at the level of treatment assignment, the level of cluster sampling, or whichever is coarser. Avoid clustering at a level that is unrelated to either source of correlation, since this can reduce power without improving validity.
When in genuine doubt between two plausible levels, clustering at the broader level is the conservative default — it gives valid (though potentially less efficient) inference.

3. How many clusters do you have?

$G \geq 50$ : CR1 standard errors with $t_{G-1}$ critical values are generally fine.
$30 \leq G < 50$ : Use CR2 + Satterthwaite degrees of freedom. Report wild cluster bootstrap as robustness.
$G < 30$ : Wild cluster bootstrap is strongly recommended. Report both CR1 and bootstrap results.
$G < 10$ : Bootstrap with Webb weights. Consider aggregation to cluster level or randomization inference.

4. Is there correlation along a second dimension?

If residuals are correlated within both clusters (e.g., firms) and time periods (e.g., years), consider two-way clustering. Remember: the effective cluster count is the minimum of the two dimensions.

Common Mistakes

Pitfalls to avoid

Clustering at too fine a level. If treatment varies at the state level, clustering at the county level or individual level dramatically understates standard errors. The standard recommendation — formalized by Abadie et al. (2023) — is to cluster at least at the level of treatment assignment (or higher, when sampling design also induces correlation). Individual-level treatment with simple random sampling is the main case where clustering is not required.
Reporting only CR1 with few clusters. With $G < 30$ , CR1 standard errors can be substantially downward-biased. The standard recommendation is to supplement with wild cluster bootstrap (or randomization inference when the assignment mechanism is known).
Ignoring clustering because N is large. Having millions of observations does not help if the treatment varies at the cluster level. Your effective sample size is the number of clusters, not the number of observations.
Clustering on a variable just because it seems like a group. Clustering should be motivated by the research design (treatment assignment or sampling), not by a vague sense that observations "might be correlated." Do not cluster on a variable that has no relationship to treatment assignment or sampling.
Not reporting the number of clusters. Readers cannot evaluate inference quality without knowing $G$ . Report the number of clusters alongside clustered standard errors.
Using two-way clustering as a cure-all. Two-way clustering does not magically fix the few-cluster problem. If one dimension has few clusters, the bootstrap should be applied to that dimension.

How to Report

A strong reporting practice includes:

Standard errors are clustered at the state level ( $G$ = 42), the level at which the policy varies. With 42 clusters, we supplement conventional cluster-robust inference with wild cluster bootstrap p-values (9,999 replications, Rademacher weights) following Cameron et al. (2008). The bootstrap p-value for the main treatment effect is 0.041, compared to the conventional clustered p-value of 0.028, confirming that our results are robust to few-cluster concerns.

Key elements to include:

The level of clustering and why
The number of clusters ( $G$ )
Wild cluster bootstrap results if $G < 50$
Both conventional and bootstrap p-values for transparency

Paper Library

Has replication code

Foundational (3)

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011). Robust Inference with Multiway Clustering.

Journal of Business & Economic StatisticsDOI: 10.1198/jbes.2010.07136

Cameron, Gelbach, and Miller extend cluster-robust variance estimation to settings with two-way (or multi-way) clustering. The variance estimator adds the two one-way cluster-robust variance matrices and subtracts the heteroscedasticity-robust matrix.

Pustejovsky, J. E., & Tipton, E. (2018). Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models.

Journal of Business & Economic StatisticsDOI: 10.1080/07350015.2016.1247004

Pustejovsky and Tipton develop the CR2 bias-reduced cluster-robust variance estimator for fixed effects models with few clusters. The CR2 correction improves coverage relative to the standard CR1 estimator when the number of clusters is small.

Webb, M. D. (2023). Reworking Wild Bootstrap-Based Inference for Clustered Errors.

Canadian Journal of EconomicsDOI: 10.1111/caje.12661

Webb introduces the six-point distribution as an alternative to Rademacher weights for the wild cluster bootstrap. The Webb weights improve finite-sample performance when the number of clusters is very small.

The Clustering Problem#

The Moulton Factor: Why Naive SEs Fail#

Why Does Correlation Arise?#

When to Cluster: The Abadie et al. Framework#

Reason 1: Clustered Treatment Assignment#

Reason 2: Clustered Sampling#

When Not to Cluster#

Cluster-Robust Standard Errors#

CR0 vs. CR1 vs. CR2#

Degrees of Freedom#

Code: Cluster-Robust Standard Errors#

The Few-Cluster Problem#

What Counts as "Few"?#

Why CR1 Fails with Few Clusters#

Wild Cluster Bootstrap#

How It Works#

Why It Works Better#

Webb Weights for Very Few Clusters#

Code: Wild Cluster Bootstrap#

Alternative Solutions for Few Clusters#

Randomization Inference#

Effective Degrees of Freedom (Satterthwaite/BM)#

Aggregation to the Cluster Level#

Code: Alternative Approaches#

Two-Way Clustering#

Failure Demo#