MethodAtlas
tutorial120 minutes

Lab: Staggered Difference-in-Differences

Tackle the staggered DiD problem step by step. Learn why two-way fixed effects fails with heterogeneous treatment effects, decompose TWFE with Bacon, and estimate robust alternatives using Callaway-Sant'Anna and Sun-Abraham.

Overview

In this lab you will estimate the effect of a state-level policy that is adopted at different times across states. You will discover that the standard two-way fixed effects (TWFE) estimator can be severely biased when treatment effects vary over time, and you will implement modern robust alternatives. This development represents a significant recent methodological advance in applied microeconometrics.

What you will learn:

  • Why standard TWFE can give negative estimates even when all treatment effects are positive
  • How to decompose the TWFE estimator using the Bacon decomposition
  • How to estimate robust group-time average treatment effects with Callaway and Sant'Anna (2021)
  • How to estimate interaction-weighted estimators with Sun and Abraham (2021)
  • How to compare and interpret results across methods

Prerequisites: Familiarity with difference-in-differences and two-way fixed effects (see the DiD and FE labs).


Step 1: Simulate Staggered Adoption Data

We create a panel of 40 states over 20 years. States adopt a policy in different years, and the treatment effect grows over time since adoption (dynamic treatment effects). A control group of states never adopts.

library(fixest)
library(did)
library(bacondecomp)
library(modelsummary)

set.seed(42)
n_states <- 40
n_years <- 20
years <- 2000:(2000 + n_years - 1)

adoption <- c(rep(0, 10), rep(2005, 8), rep(2010, 8),
            rep(2015, 7), rep(2017, 7))

state_fe <- rnorm(n_states, sd = 2)
year_fe <- seq(0, 3, length.out = n_years)

df <- expand.grid(state = 1:n_states, year = years)
df$state_fe <- state_fe[df$state]
df$year_fe <- year_fe[df$year - 1999]
df$adoption_year <- adoption[df$state]
df$treat <- as.integer(df$year >= df$adoption_year & df$adoption_year > 0)
df$years_since <- ifelse(df$treat == 1, df$year - df$adoption_year, 0)
df$tau_true <- ifelse(df$treat == 1, 0.5 + 0.1 * df$years_since, 0)
df$y <- df$state_fe + df$year_fe + df$tau_true + rnorm(nrow(df))
df$cohort <- ifelse(df$adoption_year > 0,
                   as.character(df$adoption_year), "Never")
df$state <- factor(df$state)
df$year_f <- factor(df$year)

cat("True ATT:", mean(df$tau_true[df$treat == 1]), "\n")

Expected output:

stateyearytreatadoption_yearcohorttau_true
02000-1.500Never0.0
02001-1.300Never0.0
1020052.11200520050.5
1020062.91200520050.6
1020103.81200520051.0
Panel: 40 states x 20 years = 800 obs

Cohort sizes:
  Never: 10 states
  2005:   8 states (early adopters)
  2010:   8 states (middle adopters)
  2015:   7 states (late adopters)
  2017:   7 states (very late adopters)

True ATT (all treated): ~1.05

The true ATT varies because treatment effects grow over time (tau = 0.5 + 0.1 * years_since_treatment). Early adopters have accumulated larger effects by the end of the panel.


Step 2: The TWFE Regression (and Its Problems)

# Standard TWFE
m_twfe <- feols(y ~ treat | state + year_f, data = df, vcov = ~state)

true_att <- mean(df$tau_true[df$treat == 1])
cat("=== Standard TWFE ===\n")
cat("TWFE estimate:", coef(m_twfe)["treat"], "\n")
cat("True ATT:", true_att, "\n")
cat("Bias:", coef(m_twfe)["treat"] - true_att, "\n")

Expected output:

=== Standard TWFE ===
TWFE estimate:    ~0.65
True ATT:         ~1.05
Bias:             ~-0.40
SE (clustered):   ~0.10

The TWFE estimate is biased because treatment effects
are heterogeneous across time. TWFE puts negative weight
on some group-time treatment effects.

The TWFE estimate (~0.65) substantially underestimates the true ATT (~1.05). This underestimation occurs because TWFE implicitly uses early-adopter states (whose effects have grown large) as controls for late-adopter states, generating "forbidden comparisons" that pull the estimate downward.

Concept Check

The true average treatment effect on the treated is about 1.0, but your TWFE estimate is only 0.6. What is causing this underestimation?


Step 3: Bacon Decomposition

The Bacon decomposition ((Goodman-Bacon, 2021)) shows exactly which 2x2 comparisons make up the TWFE estimate and what weights they receive.

# Bacon decomposition using bacondecomp package
# Requires numeric treatment timing variable
df_bacon <- df[, c("state", "year", "y", "adoption_year")]
df_bacon$state <- as.numeric(df_bacon$state)
df_bacon$D <- ifelse(df_bacon$adoption_year > 0, df_bacon$adoption_year, 10000)

bacon_out <- bacon(y ~ treat,
                  data = df[df$adoption_year > 0 | df$adoption_year == 0, ],
                  id_var = "state",
                  time_var = "year")

# If bacon() requires specific format:
# Show the decomposition
print(bacon_out)

# Weighted average should equal TWFE
cat("\nWeighted average:", sum(bacon_out$estimate * bacon_out$weight), "\n")
cat("TWFE estimate:", coef(m_twfe)["treat"], "\n")
Requiresbacondecomp

Expected output:

Comparison2x2 DiD EstimateTrue ATT for Cohort
Cohort 2005 vs Never~1.25~1.25
Cohort 2010 vs Never~0.75~0.75
Cohort 2015 vs Never~0.60~0.60
Cohort 2017 vs Never~0.50~0.50
2x2 DiD estimates by comparison type:
  Cohort 2005 vs Never: est = ~1.25, true ATT = ~1.25
  Cohort 2010 vs Never: est = ~0.75, true ATT = ~0.75
  Cohort 2015 vs Never: est = ~0.60, true ATT = ~0.60
  Cohort 2017 vs Never: est = ~0.50, true ATT = ~0.50

TWFE combines all these comparisons — including using
already-treated cohorts as controls for late adopters.

Clean comparisons (treated vs. never-treated) recover accurate estimates. The bias in TWFE arises from the additional comparisons where early adopters serve as controls for late adopters: their growing treatment effects create a "moving baseline" that contaminates the estimate.


Step 4: Callaway and Sant'Anna (2021)

The Callaway-Sant'Anna estimator computes group-time ATTs — the treatment effect for each cohort at each post-treatment period — using only clean comparisons (never-treated or not-yet-treated as controls).

# Callaway-Sant'Anna using the did package
df$id <- as.numeric(df$state)
df$G <- ifelse(df$adoption_year > 0, df$adoption_year, 0)

cs_out <- att_gt(
yname = "y",
tname = "year",
idname = "id",
gname = "G",
data = as.data.frame(df),
control_group = "nevertreated"
)

# Summary
summary(cs_out)

# Aggregate to simple ATT
cs_agg <- aggte(cs_out, type = "simple")
summary(cs_agg)

cat("\nCS simple ATT:", cs_agg$overall.att, "\n")
cat("TWFE:", coef(m_twfe)["treat"], "\n")
cat("True ATT:", true_att, "\n")

# Event study aggregation
cs_es <- aggte(cs_out, type = "dynamic")
ggdid(cs_es)
Requiresdid

Expected output:

Callaway-Sant'Anna (simple aggregation):
  Estimated ATT: ~1.05
  True ATT:      ~1.05
  TWFE estimate: ~0.65

CS is much closer to the truth than TWFE.

Expected output: Group-time ATTs (selected)

CohortYearATT(g,t)True Effect
20052005~0.500.50
20052008~0.800.80
20052015~1.501.50
20102010~0.500.50
20102015~1.001.00
20152015~0.500.50
20172017~0.500.50

Step 5: Sun and Abraham (2021)

Sun and Abraham (2021) propose an interaction-weighted estimator that corrects TWFE by interacting cohort indicators with relative time indicators.

# Sun and Abraham using fixest::sunab()
df$G_sa <- ifelse(df$adoption_year > 0, df$adoption_year, 10000)

m_sa <- feols(y ~ sunab(G_sa, year) | state + year_f,
            data = df, vcov = ~state)
summary(m_sa)

# Event study plot
iplot(m_sa, main = "Sun-Abraham Event Study")

# Aggregate ATT
cat("\nSun-Abraham aggregate ATT:", summary(m_sa, agg = "ATT")$coeftable[1], "\n")
cat("Callaway-Sant'Anna ATT:", cs_agg$overall.att, "\n")
cat("TWFE:", coef(m_twfe)["treat"], "\n")
cat("True:", true_att, "\n")
Requiresfixest

Expected output: Sun-Abraham event study

Relative Time (e)SA EstimateTrue Effect
e = 0~0.500.50
e = +1~0.600.60
e = +2~0.700.70
e = +3~0.800.80
e = +5~1.001.00
e = +7~1.201.20
=== Sun-Abraham Event Study ===
  e = +0: SA estimate = ~0.50, True = 0.50
  e = +1: SA estimate = ~0.60, True = 0.60
  e = +2: SA estimate = ~0.70, True = 0.70
  e = +3: SA estimate = ~0.80, True = 0.80
  ...

The Sun-Abraham estimates correctly recover the dynamic treatment effects, showing the linear growth pattern from the DGP. Each estimate is a cohort-size-weighted average of the cohort-specific effects at that relative time.

Concept Check

Both Callaway-Sant'Anna and Sun-Abraham produce estimates close to the true ATT, while TWFE is biased. What is the key difference in how these estimators handle staggered treatment?


Step 6: Compare All Estimates

# Summary comparison
cat("=" , rep("=", 50), "\n")
cat("Method", "\t\t\t", "Estimate", "\t", "True\n")
cat("TWFE:", "\t\t\t", round(coef(m_twfe)["treat"], 3), "\t",
  round(true_att, 3), "\n")
cat("Callaway-Sant'Anna:", "\t", round(cs_agg$overall.att, 3), "\t",
  round(true_att, 3), "\n")
cat("Sun-Abraham:", "\t\t",
  round(summary(m_sa, agg = "ATT")$coeftable[1], 3), "\t",
  round(true_att, 3), "\n")

# Side-by-side event study
# CS dynamic
cs_es <- aggte(cs_out, type = "dynamic")
ggdid(cs_es) + ggtitle("Callaway-Sant'Anna Event Study")

Expected output:

MethodEstimateTrue ATTBiasBias (%)
TWFE~0.65~1.05~-0.40~-38%
Callaway-Sant'Anna~1.05~1.05~0.00~0%
Sun-Abraham~1.05~1.05~0.00~0%

Step 7: Pre-Trends and Diagnostics

# Pre-trends from CS event study
cs_es <- aggte(cs_out, type = "dynamic", min_e = -5, max_e = 10)
summary(cs_es)

# Visual check
ggdid(cs_es) +
ggtitle("Event Study with Pre-Trends") +
geom_hline(yintercept = 0, linetype = "dashed")

Expected output: Pre-trends check

Relative TimePre-Treatment ATTSignificant?
e = -3~0.02No
e = -2~-0.01No
=== Pre-Trends Check ===
  e = -3: ATT = ~0.02
  e = -2: ATT = ~-0.01

Pre-treatment estimates should be close to zero.

All pre-treatment estimates are statistically indistinguishable from zero, supporting the parallel trends assumption. This pattern is expected because treatment is not anticipated in the DGP.


Step 8: Exercises

Try these on your own:

  1. Constant treatment effects. Modify the simulation so that tau = 0.5 for all cohorts and all post-periods (no dynamics). Show that TWFE recovers the correct estimate in this case.

  2. Not-yet-treated controls. Re-estimate the CS model using control_group = "notyettreated". Compare with the never-treated control group. When might this matter?

  3. Rambachan-Roth sensitivity. Install HonestDiD (R) and assess how sensitive your results are to possible violations of parallel trends.

  4. de Chaisemartin and D'Haultfoeuille. Estimate the treatment effect using their did_multiplegt estimator. Compare with CS and SA.

  5. Heterogeneity analysis. Modify the simulation so that early adopters have larger treatment effects than late adopters. How does this additional heterogeneity affect the TWFE bias?


Summary

In this lab you learned:

  • Standard TWFE can be severely biased in staggered designs when treatment effects vary across cohorts or over time
  • The Bacon decomposition reveals that TWFE is a weighted average of all 2x2 DiD comparisons, some of which use already-treated units as controls
  • Callaway and Sant'Anna (2021) estimate clean group-time ATTs using never-treated or not-yet-treated units as controls
  • Sun and Abraham (2021) correct TWFE by using interaction-weighted estimation with cohort-specific relative time effects
  • Both robust estimators recover the true dynamic treatment effects, while TWFE produces a misleading single number
  • Reporting event study plots with pre-trends is increasingly common in practice, and being transparent about the comparison between TWFE and robust estimators strengthens credibility