Guide·11 min read

Guide

Choosing Your Standard Errors

Choosing among heteroscedasticity-robust, clustered, two-way clustered, Conley spatial, and wild bootstrap SEs. Decision tree and R, Python, Stata code.

Reading Time: ~11 min read · 9 sections · 3 interactive exercises

Why Standard Errors Matter

Your point estimate is only half the story. The standard error determines the width of your confidence interval, the magnitude of your t-statistic, and whether your result is statistically distinguishable from zero. Using the wrong standard errors can make a null result look significant or a real effect look insignificant.

Getting standard errors right is not a technicality — it is a core part of credible inference. This guide walks through the major standard error choices you will face in applied work, explains when each is appropriate, and provides implementation code in R, Python, and Stata.

HC0–HC3: Heteroscedasticity-Robust Standard Errors

The Problem with Default Standard Errors

Classical Ordinary Least Squares (OLS) standard errors assume homoscedasticity: the variance of the error term is constant across all observations. In practice, this assumption rarely holds in observational data (White, 1980). When errors are heteroscedastic, the classical standard error estimator is inconsistent for the true standard error — it can be too large or too small, and you typically cannot predict which direction without knowing the form of . (The OLS coefficient estimator itself remains consistent; it is the variance estimator that fails.)

Heteroscedasticity-robust standard errors (also called White standard errors or sandwich standard errors) provide consistent estimates of the variance without requiring any assumption about the error structure.

The HC Family

There are several variants of the heteroscedasticity-consistent (HC) estimator, and they differ in how they handle finite-sample adjustments:

Variant	Formula Adjustment	Properties
HC0	None (raw White estimator)	Consistent but downward-biased in finite samples
HC1	Multiplied by $n / (n - k)$	Stata's default `vce(robust)`; simple degrees-of-freedom correction
HC2	Divides squared residuals by $1 - h_{ii}$	Unbiased under homoscedasticity; better for moderate samples
HC3	Divides squared residuals by $(1 - h_{ii})^2$	Jackknife-like correction; best finite-sample performance in the simulations of Long and Ervin (2000)

Here $h_{ii}$ is the $i$ -th diagonal element of the hat matrix, which measures the leverage of observation $i$ .

Which HC Variant to Use

HC1 is a safe default and is the standard in economics (it matches Stata's robust option). Use this for comparability with most published work.
HC3 has the best finite-sample properties and is recommended by Long and Ervin (2000) when $n \leq 250$ . The fixest package in R defaults to IID standard errors only when no fixed effects are present; when at least one FE is in the model, feols auto-clusters on the first fixed effect. To be explicit (recommended in published work), pass vcov = "hetero" for HC1 or vcov = ~cluster_var to choose the clustering level yourself.
HC0 should generally be avoided — the finite-sample bias is unnecessary to accept.
HC2 is a middle ground; it is exactly unbiased under homoscedasticity.

For most applied work, the choice between HC1 and HC3 is second-order. The first-order decision is to use some robust standard error rather than the classical default.

1# --- Step 1: HC1 robust SEs with fixest (recommended) ---
2# fixest is fast and handles FEs natively; vcov = "HC1" matches Stata's robust
3library(fixest)
4
5model <- feols(outcome ~ treatment + x1 + x2, data = df,
6             vcov = "HC1")  # HC1 = Stata-equivalent robust SEs
7summary(model)
8
9# --- Step 2: HC1 with sandwich + lmtest (base R workflow) ---
10# Use this when working with lm() objects instead of feols()
11library(sandwich)
12library(lmtest)
13model_lm <- lm(outcome ~ treatment + x1 + x2, data = df)
14# vcovHC computes the HC variance matrix; coeftest applies it
15coeftest(model_lm, vcov = vcovHC(model_lm, type = "HC1"))
16
17# --- Step 3: HC3 for small samples (n <= 250) ---
18# HC3 uses a jackknife-like correction that performs best in small samples
19model_hc3 <- feols(outcome ~ treatment + x1 + x2, data = df,
20                 vcov = "HC3")
21summary(model_hc3)

Requiresfixest sandwich lmtest

Clustering Standard Errors

When to Cluster

accounts for within-group correlation in errors. You generally want to cluster when errors are correlated within groups — typically because units within a group share common shocks, or because treatment is assigned at the group level.

The key insight comes from Abadie et al. (2023), who formalize when clustering is necessary. Their framework identifies two distinct reasons to cluster:

Sampling design: If you sample clusters (e.g., schools) and then observe all or many units within each cluster (e.g., students), you need to cluster to account for the fact that your observations are not independent draws from the population.
Treatment assignment: If treatment is assigned at the cluster level (e.g., a state-level policy), you typically need to cluster at that level to avoid overstating the effective sample size. Your effective number of independent observations is the number of clusters, not the number of individual units.

The practical rule: cluster at the level at which treatment varies or at which there are common shocks to the error term. When in doubt, cluster at the more aggregate level. Clustering at too fine a level leads to standard errors that are too small; clustering at too coarse a level produces conservative but valid inference.

At What Level to Cluster

Scenario	Cluster Level	Reason
State-level policy, individual-level data	State	Treatment varies at state level
Classroom-randomized experiment	Classroom	Treatment assigned at classroom level
Firm-level panel (firm-year observations)	Firm	Serial correlation within firms over time
DiD with state-level treatment	State	Treatment varies at state level
IV with region-level instrument	Region	Instrument varies at region level

1library(fixest)
2
3# --- Step 1: Cluster SEs at the state level with fixest ---
4# Cluster at state because treatment (policy) varies at the state level
5model <- feols(outcome ~ treatment + x1 + x2 | state + year,
6             data = df,
7             cluster = ~state)  # cluster at level of treatment variation
8summary(model)
9
10# --- Step 2: Equivalent using sandwich (for lm objects) ---
11# vcovCL computes the cluster-robust variance-covariance matrix
12library(sandwich)
13library(lmtest)
14model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(state) +
15              factor(year), data = df)
16coeftest(model_lm, vcov = vcovCL(model_lm, cluster = df$state))

Requiresfixest sandwich lmtest

Concept Check

You are estimating a difference-in-differences model where a state-level minimum wage increase is the treatment, and your unit of observation is individual workers across 50 states and 10 years. At what level should you cluster your standard errors?

At the individual worker level, since workers are the unit of observation.At the state level, since the treatment (minimum wage) varies at the state level.At the state-year level, since that is where treatment status changes.No clustering is needed if you include state and year fixed effects.

Few-Cluster Inference

The Problem with Few Clusters

Standard cluster-robust standard errors rely on asymptotics in the number of clusters. When you have few clusters — a common rule of thumb is fewer than 30 to 50 — these asymptotics break down. The cluster-robust variance estimator is biased downward, and the t-distribution approximation is too liberal. You reject the null too often.

The few-cluster problem is not a hypothetical concern. Many credible research designs have few clusters: U.S. Census divisions (9), Canadian provinces (10), treatment arms in a clustered experiment (15 schools per arm), or countries in a cross-national study.

Solution 1: Wild Cluster Bootstrap

The wild cluster bootstrap, proposed by Cameron et al. (2008), resamples at the cluster level using Rademacher weights. It works well with as few as 6 clusters (though performance degrades below that). The key idea is to impose the null hypothesis when generating bootstrap samples, which improves finite-sample performance.

1# --- Step 1: Load packages for wild cluster bootstrap ---
2# fwildclusterboot implements Cameron, Gelbach, Miller (2008)
3library(fwildclusterboot)
4library(fixest)
5
6# --- Step 2: Estimate the baseline model ---
7model <- feols(outcome ~ treatment + x1 + x2 | state + year,
8             data = df, cluster = ~state)
9
10# --- Step 3: Run the wild cluster bootstrap ---
11# Tests H0: treatment = 0; resamples at the cluster level
12boot_result <- boottest(
13model,
14param = "treatment",     # coefficient to test
15clustid = "state",       # cluster variable
16B = 9999,                # bootstrap iterations (more = more precise)
17type = "rademacher"      # +/- 1 weights (Cameron-Gelbach-Miller 2008; switch to Webb under severe cluster imbalance)
18)
19# Compare bootstrap p-value and CI to conventional CR1 results
20summary(boot_result)

Requiresfixest

Solution 2: CR2 Bias-Corrected Cluster Variance

The CR2 correction (also called the Bell-McCaffrey correction) adjusts the cluster-robust variance estimator to reduce its small-sample bias and uses Satterthwaite degrees of freedom for t-tests. CR2 is the clustered analog of HC2 for heteroscedasticity-robust SEs.

1# --- Step 1: Load clubSandwich for bias-corrected CR2 SEs ---
2library(clubSandwich)
3
4# --- Step 2: Estimate the base model with lm ---
5# clubSandwich requires lm() objects (not feols)
6model <- lm(outcome ~ treatment + x1 + x2 + factor(state) +
7          factor(year), data = df)
8
9# --- Step 3: CR2 SEs with Satterthwaite degrees of freedom ---
10# CR2 reduces the downward bias of CR1 with few or unbalanced clusters
11# Satterthwaite df can be much less than G-1 when clusters are unequal
12cr2 <- coef_test(model, vcov = "CR2", cluster = df$state,
13               test = "Satterthwaite")
14print(cr2)

Solution 3: Randomization Inference

When the number of clusters is very small (fewer than 10), even the wild cluster bootstrap may not perform well. Randomization inference provides an exact test that does not rely on asymptotic approximations. It is especially natural for experimental designs where the researcher controls treatment assignment.

The idea is simple: under the sharp null of no treatment effect, you know exactly what the outcome would have been under any treatment assignment. You enumerate (or sample from) all possible random assignments, compute the test statistic under each, and compare your observed statistic to this reference distribution.

Concept Check

You are running a cluster-randomized trial where 8 schools are randomly assigned to treatment (4 schools) and control (4 schools), with 200 students per school. You estimate the treatment effect and cluster standard errors at the school level. A reviewer says your inference may be invalid. What should you do?

Increase the number of bootstrap iterations to 99,999 to improve precision.Switch to heteroscedasticity-robust (HC1) standard errors instead of clustering.Use wild cluster bootstrap and/or randomization inference for valid inference with 8 clusters.Add student-level controls to reduce the standard errors enough to achieve significance.

Two-Way Clustering

When You Need It

Sometimes errors are correlated along two dimensions simultaneously. The classic example is firm-year panel data: errors are correlated within firms over time (serial correlation) and across firms within the same year (common year shocks). One-way clustering handles one dimension but not both.

Cameron et al. (2011) show how to construct standard errors that are robust to arbitrary correlation along two dimensions. The two-way cluster-robust variance estimator is:

\hat{V}_{two-way} = \hat{V}_{cluster_1} + \hat{V}_{cluster_2} - \hat{V}_{cluster_1 \cap cluster_2}

where $\hat{V}_{cluster_1}$ clusters on the first dimension, $\hat{V}_{cluster_2}$ clusters on the second, and $\hat{V}_{cluster_1 \cap cluster_2}$ clusters on the intersection (which corrects for double-counting).

When to Use Two-Way Clustering

Scenario	Cluster Dimension 1	Cluster Dimension 2
Firm-year panel	Firm	Year
Student test scores (student-by-year)	Student	School-year
Trade data (country-pair-by-year)	Exporter	Importer
Worker-firm matched data	Worker	Firm

Two-way clustering is appropriate when you believe there are common shocks along both dimensions that are not fully captured by fixed effects. Note that fixed effects remove the mean of the common shock, but they do not eliminate the correlation in residuals caused by the shock.

1library(fixest)
2
3# --- Step 1: Two-way clustering on firm and year ---
4# Accounts for serial correlation within firms AND common year shocks
5model <- feols(outcome ~ treatment + x1 + x2 | firm + year,
6             data = df,
7             cluster = ~firm + year)  # two-way cluster specification
8# Effective cluster count is min(firms, years)
9summary(model)
10
11# --- Step 2: Equivalent using sandwich (for lm objects) ---
12library(sandwich)
13library(lmtest)
14model_lm <- lm(outcome ~ treatment + x1 + x2 + factor(firm) +
15              factor(year), data = df)
16# vcovCL handles two-way clustering when passed a matrix of cluster IDs
17coeftest(model_lm, vcov = vcovCL(model_lm,
18       cluster = df[, c("firm", "year")]))

Requiresfixest sandwich lmtest

Spatial / Conley Standard Errors

When Clustering Is Not Enough

Clustering works well when you can define discrete groups within which errors are correlated. But what about geographically distributed data where correlation decays smoothly with distance? If you study county-level outcomes across a state, neighboring counties likely have correlated errors, but there is no natural "cluster" boundary.

Conley (1999) standard errors handle this by allowing errors to be correlated between any two observations within a specified distance (or within a specified number of time periods), with the correlation declining with distance. The key input is the distance cutoff: pairs of observations farther apart than this cutoff are assumed to have uncorrelated errors.

Choosing the Distance Cutoff

There is no single correct cutoff. Common approaches:

Domain knowledge: Use the spatial extent of the phenomenon you are studying. If a policy affects commuting zones, use a cutoff that encompasses a commuting zone radius (~100 km).
Robustness: Report results for multiple cutoffs (e.g., 50 km, 100 km, 200 km, 500 km). If conclusions change dramatically, spatial correlation is a first-order concern.
Variogram-based: Estimate the spatial correlation structure from the residuals and choose a cutoff where the correlation becomes negligible.

1library(fixest)
2
3# --- Step 1: Conley SEs with a fixed distance cutoff ---
4# Allows errors to be correlated between observations within 100 km
5# Requires lat/lon columns in the data
6model <- feols(outcome ~ treatment + x1 + x2 | state,
7             data = df,
8             vcov = conley(cutoff = 100,
9                           lat = "latitude",
10                           lon = "longitude"))
11summary(model)
12
13# --- Step 2: Robustness across multiple distance cutoffs ---
14# Report results at several cutoffs; if SEs change a lot,
15# spatial correlation is a first-order concern
16for (d in c(50, 100, 200, 500)) {
17m <- feols(outcome ~ treatment + x1 + x2 | state,
18           data = df,
19           vcov = conley(cutoff = d,
20                         lat = "latitude",
21                         lon = "longitude"))
22cat(sprintf("Cutoff = %d km: SE = %.4f\n",
23            d, se(m)["treatment"]))
24}

Requiresfixest

Decision Tree: Choosing Your Standard Errors

Standard Error Decision Tree

Follow these steps in order:

Step 1: Is there any group structure in your data?

No groups at all (pure cross-section of independent units) -> Use HC1 (or HC3 for small samples). You are done.
Yes, there is group structure -> Go to Step 2.

Step 2: Is treatment assigned at the group level?

Yes (e.g., state policy, school-level randomization) -> you typically need to cluster at the treatment-assignment level. Go to Step 3.
No, but there are plausible within-group correlations (e.g., firm panel with serial correlation) -> Cluster at the group level. Go to Step 3.
No, and there is no reason to expect within-group correlation -> Use HC1. You are done.

Step 3: How many clusters do you have?

50 or more -> Standard cluster-robust SEs are fine. Go to Step 4.
30–50 -> Cluster-robust SEs are probably OK, but report wild cluster bootstrap as robustness.
10–30 -> Use wild cluster bootstrap as the primary inference method. Report CR2 as a complement.
Fewer than 10 -> Use randomization inference or permutation tests. Report wild bootstrap as well, but note its limitations.

Step 4: Is there correlation along a second dimension?

Yes (e.g., firm panel with both serial correlation and common year shocks) -> Use two-way clustering.
The second dimension is spatial (continuous distance, not discrete groups) -> Use Conley spatial SEs.
No second dimension of concern -> One-way clustering is sufficient.

Step 5: Report and justify.

State your clustering choice and the reasoning behind it. Cite Abadie et al. (2023) if relevant.
For spatial data, report results across multiple distance cutoffs.
Consider comparing your chosen SEs with at least one alternative (e.g., show both clustered and HC robust) so readers can assess sensitivity.

Common Pitfalls

Clustering at Too Fine a Level

The most common standard error mistake in applied work is clustering at too fine a level. If a state-level policy is your treatment and you cluster at the county or individual level, you are dramatically understating your standard errors because you are treating within-state variation as independent information about the treatment effect. The canonical treatment of this pitfall and the few-cluster diagnostics that follow lives in the clustering inference page.

Clustering When It Is Not Needed

The Abadie et al. (2023) framework makes clear that clustering is not always necessary. If treatment is assigned at the individual level (e.g., an individual-level randomized experiment) and there is no sampling-based reason for correlation, heteroscedasticity-robust SEs are sufficient. Unnecessary clustering can reduce power without improving validity.

Ignoring Serial Correlation in Panels

In panel data with fixed effects, errors within a unit are typically serially correlated. Failing to cluster at the unit level (or to use Newey-West standard errors) leads to standard errors that are too small. Bertrand et al. (2004) showed that this problem is severe in DiD settings: without clustering, the false rejection rate can exceed 45% when the nominal level is 5%.

Error Detective

Read the analysis below carefully and identify the errors.

A researcher conducts a cluster-randomized trial where 30 classrooms are randomly assigned to use a new curriculum (15 treatment, 15 control), with about 25 students per classroom. She estimates the treatment effect and reports:

reg test_score treatment female age, vce(robust)

Coefficient on treatment: 4.2 (robust SE = 0.9, p < 0.001).

She writes: "Using heteroscedasticity-robust standard errors, we find that the new curriculum significantly increases test scores by 4.2 points (p < 0.001)."

Select all errors you can find:

Clusters at the individual level instead of the classroom level(vce(robust) specification)

Standard errors are dramatically understated due to ignoring the cluster structure(SE = 0.9 and p < 0.001)

The reported confidence in the result is based on an artificially inflated effective sample size(Overall inference)

Summary Table

SE Type	When to Use	Key Requirement	Packages
HC1	Default for cross-sectional data	None beyond OLS assumptions	`sandwich` (R), `statsmodels` (Python), built-in (Stata)
HC3	Small samples ( $n \leq 250$ )	None beyond OLS assumptions	`sandwich` (R), `statsmodels` (Python), `vce(hc3)` (Stata)
Cluster-robust	Within-group correlation or group-level treatment	30+ clusters for reliable asymptotics	`fixest` (R), `pyfixest` (Python), `reghdfe` (Stata)
Wild cluster bootstrap	Few clusters (6–30)	Treatment varies at cluster level	`fwildclusterboot` (R), `wildboottest` (Python), `boottest` (Stata)
CR2	Few clusters, bias correction needed	Linear model	`clubSandwich` (R)
Two-way clustering	Correlation along two dimensions	Many clusters in both dimensions	`fixest` (R), `pyfixest` (Python), `reghdfe` (Stata)
Conley spatial	Geographically correlated errors	Lat/lon coordinates, distance cutoff	`fixest` (R), `acreg` (Stata)
Randomization inference	Very few clusters ( $< 10$ ), experiments	Known assignment mechanism	`ritest` (Stata), `ri2` (R)

Why Standard Errors Matter#

HC0–HC3: Heteroscedasticity-Robust Standard Errors#

The Problem with Default Standard Errors#

The HC Family#

Which HC Variant to Use#

Clustering Standard Errors#

When to Cluster#

At What Level to Cluster#

Few-Cluster Inference#

The Problem with Few Clusters#

Solution 1: Wild Cluster Bootstrap#

Solution 2: CR2 Bias-Corrected Cluster Variance#

Solution 3: Randomization Inference#

Two-Way Clustering#

When You Need It#

When to Use Two-Way Clustering#

Spatial / Conley Standard Errors#

When Clustering Is Not Enough#

Choosing the Distance Cutoff#

Decision Tree: Choosing Your Standard Errors#

Common Pitfalls#

Clustering at Too Fine a Level#

Clustering When It Is Not Needed#

Ignoring Serial Correlation in Panels#

Summary Table#

Why Standard Errors Matter

HC0–HC3: Heteroscedasticity-Robust Standard Errors

The Problem with Default Standard Errors

The HC Family

Which HC Variant to Use

Clustering Standard Errors

When to Cluster

At What Level to Cluster

Few-Cluster Inference

The Problem with Few Clusters

Solution 1: Wild Cluster Bootstrap

Solution 2: CR2 Bias-Corrected Cluster Variance

Solution 3: Randomization Inference

Two-Way Clustering

When You Need It

When to Use Two-Way Clustering

Spatial / Conley Standard Errors

When Clustering Is Not Enough

Choosing the Distance Cutoff

Decision Tree: Choosing Your Standard Errors

Common Pitfalls

Clustering at Too Fine a Level

Clustering When It Is Not Needed

Ignoring Serial Correlation in Panels

Summary Table