Lab·tutorial·11 min read

tutorial120 minutes

Lab: Staggered Difference-in-Differences

Tackle staggered DiD: why TWFE fails under heterogeneous effects, Bacon decomposition, and robust alternatives via Callaway-Sant'Anna and Sun-Abraham.

Method: Staggered DiD
Languages: Python, R, Stata
Dataset: Simulated state policy adoption panel data

Overview

In this lab you will estimate the effect of a state-level policy that is adopted at different times across states. You will discover that the standard two-way fixed effects (TWFE) estimator can be severely biased when treatment effects vary over time, and you will implement modern robust alternatives. This development represents a significant recent methodological advance in applied microeconometrics.

What you will learn:

Why standard TWFE can give negative estimates even when all treatment effects are positive
How to decompose the TWFE estimator using the Bacon decomposition
How to estimate robust group-time average treatment effects with Callaway and Sant'Anna (2021)
How to estimate interaction-weighted estimators with Sun and Abraham (2021)
How to compare and interpret results across methods

Prerequisites: Familiarity with difference-in-differences and two-way fixed effects (see the DiD and FE labs).

Step 1: Simulate Staggered Adoption Data

We create a panel of 40 states over 20 years. States adopt a policy in different years, and the treatment effect grows over time since adoption (dynamic treatment effects). A control group of states never adopts.

1# First-time setup: install.packages(c("fixest", "did", "bacondecomp", "modelsummary"))
2library(fixest)
3library(did)
4library(bacondecomp)
5library(modelsummary)
6
7set.seed(42)
8n_states <- 40
9n_years <- 20
10years <- 2000:(2000 + n_years - 1)
11
12# Staggered adoption: 10 never-treated + 4 cohorts adopting at different times
13adoption <- c(rep(0, 10), rep(2005, 8), rep(2010, 8),
14            rep(2015, 7), rep(2017, 7))
15
16# State and time fixed effects (additive structure)
17state_fe <- rnorm(n_states, sd = 2)
18year_fe <- seq(0, 3, length.out = n_years)  # linear common time trend
19
20# Build panel
21df <- expand.grid(state = 1:n_states, year = years)
22df$state_fe <- state_fe[df$state]
23df$year_fe <- year_fe[df$year - 1999]
24df$adoption_year <- adoption[df$state]
25df$treat <- as.integer(df$year >= df$adoption_year & df$adoption_year > 0)
26df$years_since <- ifelse(df$treat == 1, df$year - df$adoption_year, 0)
27# Dynamic treatment effect: starts at 0.5 and grows 0.1 per year since adoption
28df$tau_true <- ifelse(df$treat == 1, 0.5 + 0.1 * df$years_since, 0)
29# Outcome = state FE + time FE + treatment effect + noise
30df$y <- df$state_fe + df$year_fe + df$tau_true + rnorm(nrow(df))
31df$cohort <- ifelse(df$adoption_year > 0,
32                   as.character(df$adoption_year), "Never")
33df$state <- factor(df$state)
34df$year_f <- factor(df$year)
35
36cat("True ATT:", mean(df$tau_true[df$treat == 1]), "\n")

Requiresfixest did bacondecomp modelsummary

Expected output:

state	year	y	treat	adoption_year	cohort	tau_true
0	2000	-1.5	0	0	Never	0.0
0	2001	-1.3	0	0	Never	0.0
10	2005	2.1	1	2005	2005	0.5
10	2006	2.9	1	2005	2005	0.6
10	2010	3.8	1	2005	2005	1.0

Panel: 40 states x 20 years = 800 obs

Cohort sizes:
  Never: 10 states
  2005:   8 states (early adopters)
  2010:   8 states (middle adopters)
  2015:   7 states (late adopters)
  2017:   7 states (very late adopters)

True ATT (all treated): ~1.05

The true ATT varies because treatment effects grow over time (tau = 0.5 + 0.1 * years_since_treatment). Early adopters have accumulated larger effects by the end of the panel.

Step 2: The TWFE Regression (and Its Problems)

1# Standard TWFE: single treatment indicator + unit and time FEs
2# Cluster SEs at state level (unit of treatment assignment)
3m_twfe <- feols(y ~ treat | state + year_f, data = df, vcov = ~state)
4
5true_att <- mean(df$tau_true[df$treat == 1])
6cat("=== Standard TWFE ===\n")
7cat("TWFE estimate:", coef(m_twfe)["treat"], "\n")
8cat("True ATT:", true_att, "\n")
9# Bias arises from "forbidden comparisons" using already-treated units as controls
10cat("Bias:", coef(m_twfe)["treat"] - true_att, "\n")

Expected output:

=== Standard TWFE ===
TWFE estimate:    ~0.65
True ATT:         ~1.05
Bias:             ~-0.40
SE (clustered):   ~0.10

The TWFE estimate is biased because treatment effects
are heterogeneous across time. TWFE puts negative weight
on some group-time treatment effects.

The TWFE estimate (~0.65) substantially underestimates the true ATT (~1.05). This underestimation occurs because TWFE implicitly uses early-adopter states (whose effects have grown large) as controls for late-adopter states, generating "forbidden comparisons" that pull the estimate downward.

Concept Check

The true average treatment effect on the treated is about 1.0, but your TWFE estimate is only 0.6. What is causing this underestimation?

The sample size is too small for the fixed effects to work properly.TWFE implicitly uses already-treated units (whose effects have grown larger) as controls for newly-treated units. These 'bad comparisons' assign negative weights to some treatment effects, biasing the overall estimate.The state fixed effects absorb too much variation.You need to add state-specific time trends.

Step 3: Bacon Decomposition

The Bacon decomposition (Goodman-Bacon, 2021) expresses the TWFE estimate as a weighted average of every 2×2 DiD comparison in the panel. For the underlying theory and an algebraic decomposition into clean and contaminated comparisons, see the Staggered DiD method page.

1# Bacon decomposition: shows which 2x2 comparisons compose the TWFE estimate
2# Requires numeric id and time variables, plus a binary treatment indicator
3df_num <- df
4df_num$state_num <- as.numeric(df_num$state)
5
6bacon_out <- bacon(y ~ treat,
7                  data = df_num,
8                  id_var = "state_num",
9                  time_var = "year")
10
11# Each row: a 2x2 comparison, its DiD estimate, and its weight in TWFE
12print(bacon_out)
13
14# Verify: the weighted average of all 2x2 comparisons equals the TWFE coefficient
15cat("\nWeighted average:", sum(bacon_out$estimate * bacon_out$weight), "\n")
16cat("TWFE estimate:", coef(m_twfe)["treat"], "\n")

Requiresdid

Expected output:

Comparison	2x2 DiD Estimate	True ATT for Cohort
Cohort 2005 vs Never	~1.20	~1.20
Cohort 2010 vs Never	~0.95	~0.95
Cohort 2015 vs Never	~0.70	~0.70
Cohort 2017 vs Never	~0.60	~0.60

2x2 DiD estimates by comparison type:
  Cohort 2005 vs Never: est = ~1.20, true ATT = ~1.20
  Cohort 2010 vs Never: est = ~0.95, true ATT = ~0.95
  Cohort 2015 vs Never: est = ~0.70, true ATT = ~0.70
  Cohort 2017 vs Never: est = ~0.60, true ATT = ~0.60

TWFE combines all these comparisons — including using
already-treated cohorts as controls for late adopters.

Clean comparisons (treated vs. never-treated) recover accurate estimates. The bias in TWFE arises from the additional comparisons where early adopters serve as controls for late adopters: their growing treatment effects create a "moving baseline" that contaminates the estimate.

Step 4: Callaway and Sant'Anna (2021)

The Callaway-Sant'Anna estimator computes group-time ATTs — the treatment effect for each cohort at each post-treatment period — using only clean comparisons (never-treated or not-yet-treated as controls).

1# Callaway-Sant'Anna: compute group-time ATTs using clean controls only
2# G = 0 for never-treated (the did package convention)
3df$id <- as.numeric(df$state)
4df$G <- ifelse(df$adoption_year > 0, df$adoption_year, 0)
5
6# att_gt estimates ATT(g,t) for each cohort g at each post-treatment period t
7cs_out <- att_gt(
8yname = "y",
9tname = "year",
10idname = "id",
11gname = "G",
12data = as.data.frame(df),
13control_group = "nevertreated"  # only never-treated units as controls
14)
15summary(cs_out)
16
17# Aggregate group-time ATTs to a single overall ATT
18cs_agg <- aggte(cs_out, type = "simple")
19summary(cs_agg)
20
21cat("\nCS simple ATT:", cs_agg$overall.att, "\n")
22cat("TWFE:", coef(m_twfe)["treat"], "\n")
23cat("True ATT:", true_att, "\n")
24
25# Event study: aggregate by relative time (periods since treatment)
26cs_es <- aggte(cs_out, type = "dynamic")
27ggdid(cs_es)

Requiresdid

Expected output:

Callaway-Sant'Anna (simple aggregation):
  Estimated ATT: ~1.05
  True ATT:      ~1.05
  TWFE estimate: ~0.65

CS is much closer to the truth than TWFE.

Expected output: Group-time ATTs (selected)

Cohort	Year	ATT(g,t)	True Effect
2005	2005	~0.50	0.50
2005	2008	~0.80	0.80
2005	2015	~1.50	1.50
2010	2010	~0.50	0.50
2010	2015	~1.00	1.00
2015	2015	~0.50	0.50
2017	2017	~0.50	0.50

Step 5: Sun and Abraham (2021)

Sun and Abraham (2021) propose an interaction-weighted estimator that corrects TWFE by interacting cohort indicators with relative time indicators.

1# Sun-Abraham: interaction-weighted estimator via fixest::sunab()
2# Never-treated coded as 10000 (large value signals "never treated" to sunab)
3df$G_sa <- ifelse(df$adoption_year > 0, df$adoption_year, 10000)
4
5# sunab(cohort, time) creates cohort x relative-time interactions automatically
6m_sa <- feols(y ~ sunab(G_sa, year) | state + year_f,
7            data = df, vcov = ~state)
8summary(m_sa)
9
10# Event study plot from the interaction-weighted estimates
11iplot(m_sa, main = "Sun-Abraham Event Study")
12
13# Aggregate to a single ATT (cohort-size-weighted average)
14cat("\nSun-Abraham aggregate ATT:", coef(summary(m_sa, agg = "ATT"))[1], "\n")
15cat("Callaway-Sant'Anna ATT:", cs_agg$overall.att, "\n")
16cat("TWFE:", coef(m_twfe)["treat"], "\n")
17cat("True:", true_att, "\n")

Requiresfixest

Expected output: Sun-Abraham event study

Relative Time (e)	SA Estimate	True Effect
e = 0	~0.50	0.50
e = +1	~0.60	0.60
e = +2	~0.70	0.70
e = +3	~0.80	0.80
e = +5	~1.00	1.00
e = +7	~1.20	1.20

=== Sun-Abraham Event Study ===
  e = +0: SA estimate = ~0.50, True = 0.50
  e = +1: SA estimate = ~0.60, True = 0.60
  e = +2: SA estimate = ~0.70, True = 0.70
  e = +3: SA estimate = ~0.80, True = 0.80
  ...

The Sun-Abraham estimates correctly recover the dynamic treatment effects, showing the linear growth pattern from the DGP. Each estimate is a cohort-size-weighted average of the cohort-specific effects at that relative time.

Concept Check

Both Callaway-Sant'Anna and Sun-Abraham produce estimates close to the true ATT, while TWFE is biased. What is the key difference in how these estimators handle staggered treatment?

They use different fixed effects.They avoid using already-treated units as controls. Instead, they either use never-treated (or not-yet-treated) units as the comparison group, ensuring that treatment effect dynamics in already-treated units do not contaminate the estimates.They use a different outcome variable.They require larger sample sizes.

Step 6: Compare All Estimates

1# Final comparison: all estimators vs truth
2cat("=" , rep("=", 50), "\n")
3cat("Method", "\t\t\t", "Estimate", "\t", "True\n")
4cat("TWFE:", "\t\t\t", round(coef(m_twfe)["treat"], 3), "\t",
5  round(true_att, 3), "\n")
6cat("Callaway-Sant'Anna:", "\t", round(cs_agg$overall.att, 3), "\t",
7  round(true_att, 3), "\n")
8cat("Sun-Abraham:", "\t\t",
9  round(coef(summary(m_sa, agg = "ATT"))[1], 3), "\t",
10  round(true_att, 3), "\n")
11
12# CS dynamic event study: shows effects by relative time (periods since treatment)
13cs_es <- aggte(cs_out, type = "dynamic")
14ggdid(cs_es) + ggtitle("Callaway-Sant'Anna Event Study")

Expected output:

Method	Estimate	True ATT	Bias	Bias (%)
TWFE	~0.65	~1.05	~-0.40	~-38%
Callaway-Sant'Anna	~1.05	~1.05	~0.00	~0%
Sun-Abraham	~1.05	~1.05	~0.00	~0%

Step 7: Pre-Trends and Diagnostics

1# Pre-trends test: CS event study with pre-treatment relative-time coefficients
2# Pre-treatment estimates (e < 0) should be near zero under parallel trends
3cs_es <- aggte(cs_out, type = "dynamic", min_e = -5, max_e = 10)
4summary(cs_es)
5
6# Plot: pre-treatment coefficients should cluster around zero
7ggdid(cs_es) +
8ggtitle("Event Study with Pre-Trends") +
9geom_hline(yintercept = 0, linetype = "dashed")

Expected output: Pre-trends check

Relative Time	Pre-Treatment ATT	Significant?
e = -3	~0.02	No
e = -2	~-0.01	No

=== Pre-Trends Check ===
  e = -3: ATT = ~0.02
  e = -2: ATT = ~-0.01

Pre-treatment estimates should be close to zero.

All pre-treatment estimates are statistically indistinguishable from zero, supporting the parallel trends assumption. This pattern is expected because treatment is not anticipated in the DGP.

Step 8: Exercises

Constant treatment effects. Modify the simulation so that tau = 0.5 for all cohorts and all post-periods (no dynamics). Show that TWFE recovers the correct estimate in this case.
Not-yet-treated controls. Re-estimate the CS model using control_group = "notyettreated". Compare with the never-treated control group. When might this matter?
Rambachan-Roth sensitivity. Install HonestDiD (R) and assess how sensitive your results are to possible violations of parallel trends.
de Chaisemartin and D'Haultfoeuille. Estimate the treatment effect using their did_multiplegt_dyn estimator (the modern dynamic Stata implementation; DIDmultiplegtDYN in R). Compare with CS and SA.
Heterogeneity analysis. Modify the simulation so that early adopters have larger treatment effects than late adopters. How does this additional heterogeneity affect the TWFE bias?

Expected output

If your code runs correctly, expect to see:

True ATT (all treated observations): Around 0.8–1.5, depending on the mix of cohorts and time since treatment (DGP: 0.5 + 0.1 * years_since)
TWFE static estimate: A single coefficient that may underestimate (or even be negative) the true ATT due to "forbidden comparisons" using already-treated units as controls
Bacon decomposition: Reveals negative weights on some 2x2 comparisons, particularly "later vs. earlier" comparisons where already-treated units serve as controls
Callaway-Sant'Anna group-time ATTs: Positive and growing over time for each cohort (e.g., 0.5 at k=0, 0.6 at k=1, etc.), matching the true DGP
Sun-Abraham interaction-weighted estimates: Similar to CS, correctly recovering the dynamic treatment effects
CS aggregated ATT: Close to the true average ATT, unlike the TWFE estimate
Event study plot (robust estimators): Flat pre-trends and growing post-treatment effects
Panel dimensions: 40 states x 20 years = 800 observations, with 4 adoption cohorts (2005, 2010, 2015, 2017) and 10 never-treated

Summary

In this lab you learned:

Standard TWFE can be severely biased in staggered designs when treatment effects vary across cohorts or over time
The Bacon decomposition reveals that TWFE is a weighted average of all 2x2 DiD comparisons, some of which use already-treated units as controls
Callaway and Sant'Anna (2021) estimate clean group-time ATTs using never-treated or not-yet-treated units as controls
Sun and Abraham (2021) correct TWFE by using interaction-weighted estimation with cohort-specific relative time effects
Both robust estimators recover the true dynamic treatment effects, while TWFE produces a misleading single number
Reporting event study plots with pre-trends is increasingly common in practice, and being transparent about the comparison between TWFE and robust estimators strengthens credibility

Overview#

Step 1: Simulate Staggered Adoption Data#

Step 2: The TWFE Regression (and Its Problems)#

Step 3: Bacon Decomposition#

Step 4: Callaway and Sant'Anna (2021)#

Step 5: Sun and Abraham (2021)#

Step 6: Compare All Estimates#

Step 7: Pre-Trends and Diagnostics#

Step 8: Exercises#

Summary#

Overview

Step 1: Simulate Staggered Adoption Data

Step 2: The TWFE Regression (and Its Problems)

Step 3: Bacon Decomposition

Step 4: Callaway and Sant'Anna (2021)

Step 5: Sun and Abraham (2021)

Step 6: Compare All Estimates

Step 7: Pre-Trends and Diagnostics

Step 8: Exercises

Summary