MethodAtlas
replication120 minutes

Replication Lab: How Much Should We Trust Staggered DiD?

Replicate the key findings from Baker et al. (2022) on the pitfalls of staggered difference-in-differences. Simulate a staggered adoption panel, run naive TWFE, show the Goodman-Bacon decomposition, and estimate treatment effects using Callaway-Sant'Anna and Sun-Abraham estimators.

Overview

In this replication lab, you will explore the key insights from one of the most influential recent methodological papers in applied economics:

Baker, Andrew C., David F. Larcker, and Charles C.Y. Wang. 2022. "How Much Should We Trust Staggered Difference-in-Differences Estimates?" Journal of Financial Economics 144(2): 370–395.

Baker et al. demonstrate that the standard two-way fixed effects (TWFE) estimator can produce severely biased estimates when treatment effects are heterogeneous across cohorts or over time. Using Monte Carlo simulations calibrated to common research designs in finance, the paper shows that TWFE can yield the wrong sign, wrong magnitude, or misleading inference. The paper then benchmarks several newly proposed estimators — including Callaway and Sant'Anna (2021), Sun and Abraham (2021), and de Chaisemartin and D'Haultfoeuille (2020) — that remain valid under heterogeneity.

Why the Baker et al. paper matters: It provided the first comprehensive comparison of heterogeneity-robust DiD estimators in a single framework, demonstrating that the choice of estimator can flip the sign of estimated effects. The paper helped drive widespread adoption of robust estimators in applied research.

What you will do:

  • Simulate a staggered adoption panel with heterogeneous treatment effects
  • Estimate treatment effects using naive TWFE and observe the bias
  • Decompose the TWFE estimator using the Goodman-Bacon decomposition
  • Estimate treatment effects using the Callaway-Sant'Anna estimator
  • Estimate treatment effects using the Sun-Abraham interaction-weighted estimator
  • Compare all estimators against the known true treatment effect

Step 1: Simulate the Staggered Adoption Panel

The DGP features a balanced panel of 1,000 units observed over 30 periods. Units adopt treatment in one of three cohorts (periods 10, 20, and 30), and treatment effects differ by cohort and grow over time since treatment.

library(fixest)
library(did)
library(data.table)

set.seed(2022)

N <- 1000; TT <- 30
cohort_assign <- sample(rep(c(10, 20, 30, 0), each = N / 4))

dt <- CJ(unit = 1:N, time = 1:TT)
dt[, cohort := cohort_assign[unit]]
dt[, treated := as.integer(cohort > 0 & time >= cohort)]

# Unit and time FEs
dt[, alpha_i := rnorm(1), by = unit]
dt[, delta_t := rnorm(1, 0, 0.5), by = time]

# Heterogeneous treatment effects
dt[, time_since := ifelse(treated, time - cohort, 0)]
dt[, tau := fcase(
cohort == 10 & treated == 1, 2 + 0.3 * time_since,
cohort == 20 & treated == 1, 5 + 0.1 * time_since,
cohort == 30 & treated == 1, 1.0,
default = 0
)]

dt[, y := alpha_i + delta_t + tau * treated + rnorm(.N)]

true_att <- dt[treated == 1, mean(tau)]
cat("Panel:", N, "units x", TT, "periods =", nrow(dt), "obs\n")
cat("True ATT:", round(true_att, 3), "\n")

Expected output:

Panel: 1000 units x 30 periods = 30000 obs
Cohort 10: 250 units
Cohort 20: 250 units
Cohort 30: 250 units
Never-treated: 250 units
True ATT (among treated obs): ~4.70

The true ATT depends on the mix of cohort-specific effects and the time elapsed since treatment. Cohort 20 has the largest instantaneous effect (5.0), cohort 10 has the largest cumulative effect (due to the 0.3 growth rate over 20 periods of treatment), and cohort 30 has only one treated period with a small effect of 1.0.


Step 2: Run Naive TWFE and Observe the Bias

The standard TWFE regression includes unit and time fixed effects with a single treatment indicator. Under homogeneous treatment effects, TWFE recovers the ATT. Under heterogeneity, it does not.

# Naive TWFE
twfe <- feols(y ~ treated | unit + time, data = dt)
summary(twfe)

cat("\nTrue ATT:", round(true_att, 3), "\n")
cat("TWFE estimate:", round(coef(twfe)["treated"], 3), "\n")
cat("Bias:", round(coef(twfe)["treated"] - true_att, 3), "\n")

Expected output — TWFE vs. True ATT:

EstimatorCoefficientSETrue ATTBias
Naive TWFE~3.20~0.06~4.70~ -1.50
TWFE coefficient: 3.20 (SE = 0.06)
True ATT:         4.70
Bias:             -1.50
Bias (%):         -31.9%

The TWFE estimator substantially underestimates the true ATT. The bias arises because TWFE uses already-treated units (whose treatment effects have been growing) as implicit controls for newly-treated units. The resulting "forbidden comparisons" produce negative weights on some treatment effects, pulling the estimate downward.

Concept Check

Why does the TWFE estimator produce biased estimates when treatment effects are heterogeneous across cohorts?


Step 3: Goodman-Bacon Decomposition

The Goodman-Bacon (2021) decomposition shows exactly which 2x2 DiD comparisons make up the TWFE estimate and what weight each receives.

# Using the bacondecomp package for the Goodman-Bacon decomposition
# install.packages("bacondecomp")
library(bacondecomp)

# Prepare data: need unit, time, treatment, outcome
dt_agg <- as.data.frame(dt[, .(unit, time, treated, y)])

# Bacon decomposition
bacon_out <- bacon(y ~ treated, data = dt_agg,
                 id_var = "unit", time_var = "time")
print(bacon_out)

# Weighted average
cat("\nWeighted avg:", round(sum(bacon_out$estimate * bacon_out$weight), 3))
cat("\nTrue ATT:", round(true_att, 3), "\n")
Requiresbacondecomp

Expected output — Decomposition:

ComparisonDiD EstimateWeight
Cohort 10 vs Never-treated~5.40~0.51
Cohort 20 vs Never-treated~5.30~0.36
Cohort 30 vs Never-treated~1.00~0.13

The decomposition reveals that each treated-vs-never-treated comparison recovers something close to the cohort-specific ATT. However, the full TWFE also includes treated-vs-treated comparisons (not shown in the simplified version above) where already-treated units serve as controls. When the treatment effect for early cohorts has grown large, subtracting it off introduces negative bias.


Step 4: Callaway-Sant'Anna and Sun-Abraham Estimators

Modern heterogeneity-robust estimators avoid comparing already-treated to newly-treated units. The Callaway-Sant'Anna estimator computes group-time ATTs separately for each cohort at each time period, then aggregates. The Sun-Abraham estimator uses an interaction-weighted approach.

# Callaway-Sant'Anna estimator using the did package
cs_out <- att_gt(
yname = "y",
tname = "time",
idname = "unit",
gname = "cohort",
data = as.data.frame(dt[cohort > 0 | cohort == 0]),
control_group = "nevertreated"
)

# Aggregate to overall ATT
cs_agg <- aggte(cs_out, type = "simple")
summary(cs_agg)
cat("True ATT:", round(true_att, 3), "\n")

# Sun-Abraham via fixest
dt[, cohort_sa := fifelse(cohort == 0, 10000, cohort)]
sa_model <- feols(y ~ sunab(cohort_sa, time) | unit + time,
                data = dt)
summary(sa_model, agg = "ATT")
cat("True ATT:", round(true_att, 3), "\n")
Requiresdidfixest

Expected output — Estimator comparison:

EstimatorATT EstimateTrue ATTBias
Naive TWFE~3.20~4.70~ -1.50
Callaway-Sant'Anna~4.65~4.70~ -0.05
Sun-Abraham~4.68~4.70~ -0.02
Callaway-Sant'Anna ATT: 4.65
Sun-Abraham ATT: 4.68
True ATT: 4.70

Both heterogeneity-robust estimators recover the true ATT with minimal bias, while the naive TWFE estimate is off by over 30%. The small remaining differences in the robust estimators reflect sampling variability, not systematic bias.

Concept Check

What is the key difference between how TWFE and the Callaway-Sant'Anna estimator handle already-treated units?


Step 5: Compare All Estimators and Dynamic Effects

Bring all results together and estimate dynamic (event-study) treatment effects using the robust estimators.

# Dynamic effects via CS
cs_dyn <- aggte(cs_out, type = "dynamic")
summary(cs_dyn)

# Plot the event study
ggdid(cs_dyn) +
labs(title = "Callaway-Sant'Anna Dynamic Treatment Effects",
     x = "Periods Relative to Treatment",
     y = "ATT")

# Final comparison
cat("\n=== Final Comparison ===\n")
cat("TWFE:", round(coef(twfe)["treated"], 3), "\n")
cat("CS:", round(cs_agg$overall.att, 3), "\n")
cat("True ATT:", round(true_att, 3), "\n")

Expected output — Dynamic effects (selected relative times):

Relative TimeCS EstimateTrue ATT(e)
0~2.70~2.67
1~2.85~2.83
5~4.10~4.00
10~5.30~5.17
15~6.50~6.33
20~8.00~8.00

Final estimator comparison:

EstimatorEstimateBiasBias %
Naive TWFE~3.20~ -1.50~ -32%
Callaway-Sant'Anna~4.65~ -0.05~ -1%
Sun-Abraham~4.68~ -0.02~ -0.4%
True ATT~4.70------

The dynamic effects show treatment effects growing with time since treatment, which is a key source of heterogeneity that TWFE fails to accommodate. The CS estimator tracks the true dynamic path closely.

Concept Check

Baker et al. (2022) recommend reporting results from multiple robust estimators rather than relying on a single approach. Why might different robust estimators give slightly different results even when all are consistent?


Summary

The replication of Baker et al. (2022) yields several key takeaways:

  1. TWFE bias is substantial. When treatment effects differ across cohorts or evolve over time, naive TWFE can underestimate (or even reverse the sign of) the true ATT by 30% or more.

  2. The source of bias is forbidden comparisons. TWFE implicitly uses already-treated units as controls, and the Goodman-Bacon decomposition reveals the resulting negative weights.

  3. Robust estimators fix the problem. Both Callaway-Sant'Anna and Sun-Abraham recover the true ATT by restricting comparisons to appropriate control groups and aggregating properly.

  4. Report multiple estimators. Baker et al. recommend running several robust estimators as a robustness check. When all approaches agree, confidence in the findings is strengthened.

  5. Dynamic effects matter. Event-study-style reporting reveals treatment effect dynamics that a single TWFE coefficient obscures.


Extension Exercises

  1. Vary the degree of heterogeneity. Set all cohort effects equal (homogeneous case) and verify that TWFE recovers the correct ATT. Gradually increase heterogeneity and plot the TWFE bias as a function of the degree of heterogeneity.

  2. Not-yet-treated controls. Re-run the Callaway-Sant'Anna estimator using not-yet-treated units as controls instead of never-treated units. Compare the estimates and discuss when each choice is appropriate.

  3. de Chaisemartin-D'Haultfoeuille estimator. Implement the estimator from de Chaisemartin and D'Haultfoeuille (2020) using did_multiplegt (R/Stata) or a manual implementation. Compare with the CS and SA estimates.

  4. Treatment effect reversal. Modify the DGP so that cohort 10 has a negative treatment effect while cohorts 20 and 30 have positive effects. Show that TWFE can yield a positive estimate even when the true ATT is negative.

  5. Pre-testing. Add a pre-treatment "placebo" effect for one cohort and verify that the CS estimator's pre-trend test detects the violation.

  6. Imputation estimator. Implement the Borusyak et al. (2024) imputation estimator and compare with the other robust methods.

  7. Stacking estimator. Implement the Cengiz et al. (2019) stacking approach: create separate DiD datasets for each cohort and stack them with cohort-specific fixed effects. Compare the stacking estimate with CS and SA.

  8. Real-world application. Apply all five estimators (TWFE, CS, SA, dCDH, imputation) to a real staggered adoption dataset (e.g., state-level policy adoption) and discuss which conclusions change when switching from TWFE to robust estimators.