Choosing Your Standard Errors
When to use heteroscedasticity-robust, clustered, two-way clustered, Conley spatial, or wild bootstrap standard errors. Decision tree with code for every SE type in R, Python, and Stata.
Why Standard Errors Matter
Your point estimate is only half the story. The standard error determines the width of your confidence interval, the magnitude of your t-statistic, and whether your result is statistically distinguishable from zero. Using the wrong standard errors can make a null result look significant or a real effect look insignificant.
Getting standard errors right is not a technicality -- it is a core part of credible inference. This guide walks through the major standard error choices you will face in applied work, explains when each is appropriate, and provides implementation code in R, Python, and Stata.
HC0--HC3: Heteroscedasticity-Robust Standard Errors
The Problem with Default Standard Errors
Classical OLS standard errors assume homoscedasticity: the variance of the error term is constant across all observations. In practice, this assumption almost never holds. When errors are heteroscedastic, classical standard errors are inconsistent -- they can be too large or too small, and you cannot predict which direction without knowing the form of heteroscedasticity.
Heteroscedasticity-robust standard errors (also called White standard errors or sandwich standard errors) provide consistent estimates of the variance without requiring any assumption about the error structure.
The HC Family
There are several variants of the heteroscedasticity-consistent (HC) estimator, and they differ in how they handle finite-sample adjustments:
| Variant | Formula Adjustment | Properties |
|---|---|---|
| HC0 | None (raw White estimator) | Consistent but downward-biased in finite samples |
| HC1 | Multiplied by | Stata's default vce(robust); simple degrees-of-freedom correction |
| HC2 | Divides squared residuals by | Unbiased under homoscedasticity; better for moderate samples |
| HC3 | Divides squared residuals by | Jackknife-like correction; best finite-sample performance; R's default in fixest |
Here is the -th diagonal element of the hat matrix, which measures the leverage of observation .
Which HC Variant to Use
- HC1 is a safe default and is the standard in economics (it matches Stata's
robustoption). Use this for comparability with most published work. - HC3 has the best finite-sample properties and is recommended by Long and Ervin (2000) when . The
fixestpackage in R uses HC1 by default, but you can specify HC3. - HC0 should generally be avoided -- the finite-sample bias is unnecessary to accept.
- HC2 is a middle ground; it is exactly unbiased under homoscedasticity.
For most applied work, the choice between HC1 and HC3 is second-order. The first-order decision is to use some robust standard error rather than the classical default.
library(fixest)
# Using fixest (recommended for panel/applied work)
model <- feols(outcome ~ treatment + x1 + x2, data = df,
vcov = "HC1")
summary(model)
# Using sandwich + lmtest (base R workflow)
library(sandwich)
library(lmtest)
model_lm <- lm(outcome ~ treatment + x1 + x2, data = df)
coeftest(model_lm, vcov = vcovHC(model_lm, type = "HC1"))
# HC3 for small samples
model_hc3 <- feols(outcome ~ treatment + x1 + x2, data = df,
vcov = "HC3")
summary(model_hc3)Clustering Standard Errors
When to Cluster
Clustering accounts for within-group correlation in errors. You should cluster when errors are correlated within groups -- typically because units within a group share common shocks, or because treatment is assigned at the group level.
The key insight comes from Abadie et al. (2023), who formalize when clustering is necessary. Their framework identifies two distinct reasons to cluster:
-
Sampling design: If you sample clusters (e.g., schools) and then observe all or many units within each cluster (e.g., students), you need to cluster to account for the fact that your observations are not independent draws from the population.
-
Treatment assignment: If treatment is assigned at the cluster level (e.g., a state-level policy), you must cluster at that level to avoid overstating the effective sample size. Your effective number of independent observations is the number of clusters, not the number of individual units.
The practical rule: cluster at the level at which treatment varies or at which there are common shocks to the error term. When in doubt, cluster at the more aggregate level. Clustering at too fine a level leads to standard errors that are too small; clustering at too coarse a level produces conservative but valid inference.
At What Level to Cluster
| Scenario | Cluster Level | Reason |
|---|---|---|
| State-level policy, individual-level data | State | Treatment varies at state level |
| Classroom-randomized experiment | Classroom | Treatment assigned at classroom level |
| Firm-level panel (firm-year observations) | Firm | Serial correlation within firms over time |
| DiD with state-level treatment | State | Treatment varies at state level |
| IV with region-level instrument | Region | Instrument varies at region level |
library(fixest)
# Cluster at the state level
model <- feols(outcome ~ treatment + x1 + x2 | state + year,
data = df,
cluster = ~state)
summary(model)
# Equivalent using sandwich (for lm objects)
library(sandwich)
library(lmtest)
model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(state) +
factor(year), data = df)
coeftest(model_lm, vcov = vcovCL(model_lm, cluster = df$state))You are estimating a difference-in-differences model where a state-level minimum wage increase is the treatment, and your unit of observation is individual workers across 50 states and 10 years. At what level should you cluster your standard errors?
Few-Cluster Inference
The Problem with Few Clusters
Standard cluster-robust standard errors rely on asymptotics in the number of clusters. When you have few clusters -- a common rule of thumb is fewer than 30 to 50 -- these asymptotics break down. The cluster-robust variance estimator is biased downward, and the t-distribution approximation is too liberal. You reject the null too often.
The few-cluster problem is not a hypothetical concern. Many credible research designs have few clusters: U.S. Census divisions (9), Canadian provinces (10), treatment arms in a clustered experiment (15 schools per arm), or countries in a cross-national study.
Solution 1: Wild Cluster Bootstrap
The wild cluster bootstrap, proposed by Cameron, Gelbach, and Miller (2008), resamples at the cluster level using Rademacher weights. It works well with as few as 6 clusters (though performance degrades below that). The key idea is to impose the null hypothesis when generating bootstrap samples, which improves finite-sample performance.
library(fwildclusterboot)
library(fixest)
# First, estimate the model
model <- feols(outcome ~ treatment + x1 + x2 | state + year,
data = df, cluster = ~state)
# Wild cluster bootstrap p-value and CI
boot_result <- boottest(
model,
param = "treatment",
clustid = "state",
B = 9999, # number of bootstrap iterations
type = "rademacher" # Rademacher weights (default)
)
summary(boot_result)Solution 2: CR2 Bias-Corrected Cluster Variance
The CR2 correction (also called the Bell-McCaffrey correction) adjusts the cluster-robust variance estimator to reduce its small-sample bias and uses Satterthwaite degrees of freedom for t-tests. CR2 is the clustered analog of HC2 for heteroscedasticity-robust SEs.
library(clubSandwich)
# Estimate base model
model <- lm(outcome ~ treatment + x1 + x2 + factor(state) +
factor(year), data = df)
# CR2 standard errors with Satterthwaite df
cr2 <- coef_test(model, vcov = "CR2", cluster = df$state,
test = "Satterthwaite")
print(cr2)Solution 3: Randomization Inference
When the number of clusters is very small (fewer than 10), even the wild cluster bootstrap may not perform well. Randomization inference provides an exact test that does not rely on asymptotic approximations. It is especially natural for experimental designs where the researcher controls treatment assignment.
The idea is simple: under the sharp null of no treatment effect, you know exactly what the outcome would have been under any treatment assignment. You enumerate (or sample from) all possible random assignments, compute the test statistic under each, and compare your observed statistic to this reference distribution.
You are running a cluster-randomized trial where 8 schools are randomly assigned to treatment (4 schools) and control (4 schools), with 200 students per school. You estimate the treatment effect and cluster standard errors at the school level. A reviewer says your inference may be invalid. What should you do?
Two-Way Clustering
When You Need It
Sometimes errors are correlated along two dimensions simultaneously. The classic example is firm-year panel data: errors are correlated within firms over time (serial correlation) and across firms within the same year (common year shocks). One-way clustering handles one dimension but not both.
Cameron et al. (2011) show how to construct standard errors that are robust to arbitrary correlation along two dimensions. The two-way cluster-robust variance estimator is:
where clusters on the first dimension, clusters on the second, and clusters on the intersection (which corrects for double-counting).
When to Use Two-Way Clustering
| Scenario | Cluster Dimension 1 | Cluster Dimension 2 |
|---|---|---|
| Firm-year panel | Firm | Year |
| Student test scores (student-by-year) | Student | School-year |
| Trade data (country-pair-by-year) | Exporter | Importer |
| Worker-firm matched data | Worker | Firm |
Two-way clustering is appropriate when you believe there are common shocks along both dimensions that are not fully captured by fixed effects. Note that fixed effects remove the mean of the common shock, but they do not eliminate the correlation in residuals caused by the shock.
library(fixest)
# Two-way clustering on firm and year
model <- feols(outcome ~ treatment + x1 + x2 | firm + year,
data = df,
cluster = ~firm + year)
summary(model)
# Equivalent with sandwich
library(sandwich)
library(lmtest)
model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(firm) +
factor(year), data = df)
coeftest(model_lm, vcov = vcovCL(model_lm,
cluster = df[, c("firm", "year")]))Spatial / Conley Standard Errors
When Clustering Is Not Enough
Clustering works well when you can define discrete groups within which errors are correlated. But what about geographically distributed data where correlation decays smoothly with distance? If you study county-level outcomes across a state, neighboring counties likely have correlated errors, but there is no natural "cluster" boundary.
Conley (1999) standard errors handle this by allowing errors to be correlated between any two observations within a specified distance (or within a specified number of time periods), with the correlation declining with distance. The key input is the distance cutoff: pairs of observations farther apart than this cutoff are assumed to have uncorrelated errors.
Choosing the Distance Cutoff
There is no single correct cutoff. Common approaches:
- Domain knowledge: Use the spatial extent of the phenomenon you are studying. If a policy affects commuting zones, use a cutoff that encompasses a commuting zone radius (~100 km).
- Robustness: Report results for multiple cutoffs (e.g., 50 km, 100 km, 200 km, 500 km). If conclusions change dramatically, spatial correlation is a first-order concern.
- Variogram-based: Estimate the spatial correlation structure from the residuals and choose a cutoff where the correlation becomes negligible.
library(fixest)
# Conley SEs with 100km distance cutoff
# Requires latitude and longitude in the data
model <- feols(outcome ~ treatment + x1 + x2 | state,
data = df,
vcov = conley(cutoff = 100,
lat = "latitude",
lon = "longitude"))
summary(model)
# For robustness, try multiple cutoffs
for (d in c(50, 100, 200, 500)) {
m <- feols(outcome ~ treatment + x1 + x2 | state,
data = df,
vcov = conley(cutoff = d,
lat = "latitude",
lon = "longitude"))
cat(sprintf("Cutoff = %d km: SE = %.4f\n",
d, se(m)["treatment"]))
}Decision Tree: Choosing Your Standard Errors
Common Pitfalls
Clustering at Too Fine a Level
The most common standard error mistake in applied work is clustering at too fine a level. If a state-level policy is your treatment and you cluster at the county or individual level, you are dramatically understating your standard errors because you are treating within-state variation as independent information about the treatment effect.
Clustering When It Is Not Needed
The Abadie et al. (2023) framework makes clear that clustering is not always necessary. If treatment is assigned at the individual level (e.g., an individual-level randomized experiment) and there is no sampling-based reason for correlation, heteroscedasticity-robust SEs are sufficient. Unnecessary clustering can reduce power without improving validity.
Ignoring Serial Correlation in Panels
In panel data with fixed effects, errors within a unit are typically serially correlated. Failing to cluster at the unit level (or to use Newey-West standard errors) leads to standard errors that are too small. Bertrand, Duflo, and Mullainathan (2004) showed that this problem is severe in DiD settings: without clustering, the false rejection rate can exceed 40% when the nominal level is 5%.
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Summary Table
| SE Type | When to Use | Key Requirement | Packages |
|---|---|---|---|
| HC1 | Default for cross-sectional data | None beyond OLS assumptions | sandwich (R), statsmodels (Python), built-in (Stata) |
| HC3 | Small samples () | None beyond OLS assumptions | sandwich (R), statsmodels (Python), vce(hc3) (Stata) |
| Cluster-robust | Within-group correlation or group-level treatment | 30+ clusters for reliable asymptotics | fixest (R), pyfixest (Python), reghdfe (Stata) |
| Wild cluster bootstrap | Few clusters (6--30) | Treatment varies at cluster level | fwildclusterboot (R), wildboottest (Python), boottest (Stata) |
| CR2 | Few clusters, bias correction needed | Linear model | clubSandwich (R) |
| Two-way clustering | Correlation along two dimensions | Many clusters in both dimensions | fixest (R), pyfixest (Python), reghdfe (Stata) |
| Conley spatial | Geographically correlated errors | Lat/lon coordinates, distance cutoff | fixest (R), acreg (Stata) |
| Randomization inference | Very few clusters (), experiments | Known assignment mechanism | ritest (Stata), ri2 (R) |