MethodAtlas
Guide

Choosing Your Standard Errors

When to use heteroscedasticity-robust, clustered, two-way clustered, Conley spatial, or wild bootstrap standard errors. Decision tree with code for every SE type in R, Python, and Stata.

Why Standard Errors Matter

Your point estimate is only half the story. The standard error determines the width of your confidence interval, the magnitude of your t-statistic, and whether your result is statistically distinguishable from zero. Using the wrong standard errors can make a null result look significant or a real effect look insignificant.

Getting standard errors right is not a technicality -- it is a core part of credible inference. This guide walks through the major standard error choices you will face in applied work, explains when each is appropriate, and provides implementation code in R, Python, and Stata.

HC0--HC3: Heteroscedasticity-Robust Standard Errors

The Problem with Default Standard Errors

Classical OLS standard errors assume homoscedasticity: the variance of the error term is constant across all observations. In practice, this assumption almost never holds. When errors are heteroscedastic, classical standard errors are inconsistent -- they can be too large or too small, and you cannot predict which direction without knowing the form of heteroscedasticity.

Heteroscedasticity-robust standard errors (also called White standard errors or sandwich standard errors) provide consistent estimates of the variance without requiring any assumption about the error structure.

The HC Family

There are several variants of the heteroscedasticity-consistent (HC) estimator, and they differ in how they handle finite-sample adjustments:

VariantFormula AdjustmentProperties
HC0None (raw White estimator)Consistent but downward-biased in finite samples
HC1Multiplied by n/(nk)n / (n - k)Stata's default vce(robust); simple degrees-of-freedom correction
HC2Divides squared residuals by 1hii1 - h_{ii}Unbiased under homoscedasticity; better for moderate samples
HC3Divides squared residuals by (1hii)2(1 - h_{ii})^2Jackknife-like correction; best finite-sample performance; R's default in fixest

Here hiih_{ii} is the ii-th diagonal element of the hat matrix, which measures the leverage of observation ii.

Which HC Variant to Use

  • HC1 is a safe default and is the standard in economics (it matches Stata's robust option). Use this for comparability with most published work.
  • HC3 has the best finite-sample properties and is recommended by Long and Ervin (2000) when n<250n < 250. The fixest package in R uses HC1 by default, but you can specify HC3.
  • HC0 should generally be avoided -- the finite-sample bias is unnecessary to accept.
  • HC2 is a middle ground; it is exactly unbiased under homoscedasticity.

For most applied work, the choice between HC1 and HC3 is second-order. The first-order decision is to use some robust standard error rather than the classical default.

library(fixest)

# Using fixest (recommended for panel/applied work)
model <- feols(outcome ~ treatment + x1 + x2, data = df,
             vcov = "HC1")
summary(model)

# Using sandwich + lmtest (base R workflow)
library(sandwich)
library(lmtest)
model_lm <- lm(outcome ~ treatment + x1 + x2, data = df)
coeftest(model_lm, vcov = vcovHC(model_lm, type = "HC1"))

# HC3 for small samples
model_hc3 <- feols(outcome ~ treatment + x1 + x2, data = df,
                 vcov = "HC3")
summary(model_hc3)

Clustering Standard Errors

When to Cluster

Clustering accounts for within-group correlation in errors. You should cluster when errors are correlated within groups -- typically because units within a group share common shocks, or because treatment is assigned at the group level.

The key insight comes from Abadie et al. (2023), who formalize when clustering is necessary. Their framework identifies two distinct reasons to cluster:

  1. Sampling design: If you sample clusters (e.g., schools) and then observe all or many units within each cluster (e.g., students), you need to cluster to account for the fact that your observations are not independent draws from the population.

  2. Treatment assignment: If treatment is assigned at the cluster level (e.g., a state-level policy), you must cluster at that level to avoid overstating the effective sample size. Your effective number of independent observations is the number of clusters, not the number of individual units.

The practical rule: cluster at the level at which treatment varies or at which there are common shocks to the error term. When in doubt, cluster at the more aggregate level. Clustering at too fine a level leads to standard errors that are too small; clustering at too coarse a level produces conservative but valid inference.

At What Level to Cluster

ScenarioCluster LevelReason
State-level policy, individual-level dataStateTreatment varies at state level
Classroom-randomized experimentClassroomTreatment assigned at classroom level
Firm-level panel (firm-year observations)FirmSerial correlation within firms over time
DiD with state-level treatmentStateTreatment varies at state level
IV with region-level instrumentRegionInstrument varies at region level
library(fixest)

# Cluster at the state level
model <- feols(outcome ~ treatment + x1 + x2 | state + year,
             data = df,
             cluster = ~state)
summary(model)

# Equivalent using sandwich (for lm objects)
library(sandwich)
library(lmtest)
model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(state) +
              factor(year), data = df)
coeftest(model_lm, vcov = vcovCL(model_lm, cluster = df$state))
Concept Check

You are estimating a difference-in-differences model where a state-level minimum wage increase is the treatment, and your unit of observation is individual workers across 50 states and 10 years. At what level should you cluster your standard errors?

Few-Cluster Inference

The Problem with Few Clusters

Standard cluster-robust standard errors rely on asymptotics in the number of clusters. When you have few clusters -- a common rule of thumb is fewer than 30 to 50 -- these asymptotics break down. The cluster-robust variance estimator is biased downward, and the t-distribution approximation is too liberal. You reject the null too often.

The few-cluster problem is not a hypothetical concern. Many credible research designs have few clusters: U.S. Census divisions (9), Canadian provinces (10), treatment arms in a clustered experiment (15 schools per arm), or countries in a cross-national study.

Solution 1: Wild Cluster Bootstrap

The wild cluster bootstrap, proposed by Cameron, Gelbach, and Miller (2008), resamples at the cluster level using Rademacher weights. It works well with as few as 6 clusters (though performance degrades below that). The key idea is to impose the null hypothesis when generating bootstrap samples, which improves finite-sample performance.

library(fwildclusterboot)
library(fixest)

# First, estimate the model
model <- feols(outcome ~ treatment + x1 + x2 | state + year,
             data = df, cluster = ~state)

# Wild cluster bootstrap p-value and CI
boot_result <- boottest(
model,
param = "treatment",
clustid = "state",
B = 9999,               # number of bootstrap iterations
type = "rademacher"      # Rademacher weights (default)
)
summary(boot_result)
Requiresfixest

Solution 2: CR2 Bias-Corrected Cluster Variance

The CR2 correction (also called the Bell-McCaffrey correction) adjusts the cluster-robust variance estimator to reduce its small-sample bias and uses Satterthwaite degrees of freedom for t-tests. CR2 is the clustered analog of HC2 for heteroscedasticity-robust SEs.

library(clubSandwich)

# Estimate base model
model <- lm(outcome ~ treatment + x1 + x2 + factor(state) +
          factor(year), data = df)

# CR2 standard errors with Satterthwaite df
cr2 <- coef_test(model, vcov = "CR2", cluster = df$state,
               test = "Satterthwaite")
print(cr2)

Solution 3: Randomization Inference

When the number of clusters is very small (fewer than 10), even the wild cluster bootstrap may not perform well. Randomization inference provides an exact test that does not rely on asymptotic approximations. It is especially natural for experimental designs where the researcher controls treatment assignment.

The idea is simple: under the sharp null of no treatment effect, you know exactly what the outcome would have been under any treatment assignment. You enumerate (or sample from) all possible random assignments, compute the test statistic under each, and compare your observed statistic to this reference distribution.

Concept Check

You are running a cluster-randomized trial where 8 schools are randomly assigned to treatment (4 schools) and control (4 schools), with 200 students per school. You estimate the treatment effect and cluster standard errors at the school level. A reviewer says your inference may be invalid. What should you do?

Two-Way Clustering

When You Need It

Sometimes errors are correlated along two dimensions simultaneously. The classic example is firm-year panel data: errors are correlated within firms over time (serial correlation) and across firms within the same year (common year shocks). One-way clustering handles one dimension but not both.

Cameron et al. (2011) show how to construct standard errors that are robust to arbitrary correlation along two dimensions. The two-way cluster-robust variance estimator is:

V^twoway=V^cluster1+V^cluster2V^cluster1cluster2\hat{V}_{two-way} = \hat{V}_{cluster_1} + \hat{V}_{cluster_2} - \hat{V}_{cluster_1 \cap cluster_2}

where V^cluster1\hat{V}_{cluster_1} clusters on the first dimension, V^cluster2\hat{V}_{cluster_2} clusters on the second, and V^cluster1cluster2\hat{V}_{cluster_1 \cap cluster_2} clusters on the intersection (which corrects for double-counting).

When to Use Two-Way Clustering

ScenarioCluster Dimension 1Cluster Dimension 2
Firm-year panelFirmYear
Student test scores (student-by-year)StudentSchool-year
Trade data (country-pair-by-year)ExporterImporter
Worker-firm matched dataWorkerFirm

Two-way clustering is appropriate when you believe there are common shocks along both dimensions that are not fully captured by fixed effects. Note that fixed effects remove the mean of the common shock, but they do not eliminate the correlation in residuals caused by the shock.

library(fixest)

# Two-way clustering on firm and year
model <- feols(outcome ~ treatment + x1 + x2 | firm + year,
             data = df,
             cluster = ~firm + year)
summary(model)

# Equivalent with sandwich
library(sandwich)
library(lmtest)
model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(firm) +
              factor(year), data = df)
coeftest(model_lm, vcov = vcovCL(model_lm,
       cluster = df[, c("firm", "year")]))

Spatial / Conley Standard Errors

When Clustering Is Not Enough

Clustering works well when you can define discrete groups within which errors are correlated. But what about geographically distributed data where correlation decays smoothly with distance? If you study county-level outcomes across a state, neighboring counties likely have correlated errors, but there is no natural "cluster" boundary.

Conley (1999) standard errors handle this by allowing errors to be correlated between any two observations within a specified distance (or within a specified number of time periods), with the correlation declining with distance. The key input is the distance cutoff: pairs of observations farther apart than this cutoff are assumed to have uncorrelated errors.

Choosing the Distance Cutoff

There is no single correct cutoff. Common approaches:

  • Domain knowledge: Use the spatial extent of the phenomenon you are studying. If a policy affects commuting zones, use a cutoff that encompasses a commuting zone radius (~100 km).
  • Robustness: Report results for multiple cutoffs (e.g., 50 km, 100 km, 200 km, 500 km). If conclusions change dramatically, spatial correlation is a first-order concern.
  • Variogram-based: Estimate the spatial correlation structure from the residuals and choose a cutoff where the correlation becomes negligible.
library(fixest)

# Conley SEs with 100km distance cutoff
# Requires latitude and longitude in the data
model <- feols(outcome ~ treatment + x1 + x2 | state,
             data = df,
             vcov = conley(cutoff = 100,
                           lat = "latitude",
                           lon = "longitude"))
summary(model)

# For robustness, try multiple cutoffs
for (d in c(50, 100, 200, 500)) {
m <- feols(outcome ~ treatment + x1 + x2 | state,
           data = df,
           vcov = conley(cutoff = d,
                         lat = "latitude",
                         lon = "longitude"))
cat(sprintf("Cutoff = %d km: SE = %.4f\n",
            d, se(m)["treatment"]))
}
Requiresfixest

Decision Tree: Choosing Your Standard Errors

Common Pitfalls

Clustering at Too Fine a Level

The most common standard error mistake in applied work is clustering at too fine a level. If a state-level policy is your treatment and you cluster at the county or individual level, you are dramatically understating your standard errors because you are treating within-state variation as independent information about the treatment effect.

Clustering When It Is Not Needed

The Abadie et al. (2023) framework makes clear that clustering is not always necessary. If treatment is assigned at the individual level (e.g., an individual-level randomized experiment) and there is no sampling-based reason for correlation, heteroscedasticity-robust SEs are sufficient. Unnecessary clustering can reduce power without improving validity.

Ignoring Serial Correlation in Panels

In panel data with fixed effects, errors within a unit are typically serially correlated. Failing to cluster at the unit level (or to use Newey-West standard errors) leads to standard errors that are too small. Bertrand, Duflo, and Mullainathan (2004) showed that this problem is severe in DiD settings: without clustering, the false rejection rate can exceed 40% when the nominal level is 5%.

Error Detective

Read the analysis below carefully and identify the errors.

A researcher conducts a cluster-randomized trial where 30 classrooms are randomly assigned to use a new curriculum (15 treatment, 15 control), with about 25 students per classroom. She estimates the treatment effect and reports: reg test_score treatment female age, vce(robust) Coefficient on treatment: 4.2 (robust SE = 0.9, p < 0.001). She writes: "Using heteroscedasticity-robust standard errors, we find that the new curriculum significantly increases test scores by 4.2 points (p < 0.001)."

Select all errors you can find:

Summary Table

SE TypeWhen to UseKey RequirementPackages
HC1Default for cross-sectional dataNone beyond OLS assumptionssandwich (R), statsmodels (Python), built-in (Stata)
HC3Small samples (n<250n < 250)None beyond OLS assumptionssandwich (R), statsmodels (Python), vce(hc3) (Stata)
Cluster-robustWithin-group correlation or group-level treatment30+ clusters for reliable asymptoticsfixest (R), pyfixest (Python), reghdfe (Stata)
Wild cluster bootstrapFew clusters (6--30)Treatment varies at cluster levelfwildclusterboot (R), wildboottest (Python), boottest (Stata)
CR2Few clusters, bias correction neededLinear modelclubSandwich (R)
Two-way clusteringCorrelation along two dimensionsMany clusters in both dimensionsfixest (R), pyfixest (Python), reghdfe (Stata)
Conley spatialGeographically correlated errorsLat/lon coordinates, distance cutofffixest (R), acreg (Stata)
Randomization inferenceVery few clusters (<10< 10), experimentsKnown assignment mechanismritest (Stata), ri2 (R)