Lab·replication·8 min read

replication120 minutes

Replication Lab: How Much Should We Trust Staggered DiD?

Replicate Baker et al. (2022) on staggered DiD pitfalls: simulate staggered adoption, run naive TWFE, decompose with Bacon, apply Callaway-Sant'Anna.

Method: Staggered DiD
Languages: Python, R, Stata
Dataset: Simulated staggered adoption panel matching Baker et al. (2022) DGP

Overview

In this replication lab, you will explore the key insights from one of the most influential recent methodological papers in applied economics:

Baker, Andrew C., David F. Larcker, and Charles C.Y. Wang. 2022. "How Much Should We Trust Staggered Difference-in-Differences Estimates?" Journal of Financial Economics 144(2): 370–395.

Baker et al. (2022) demonstrate that the standard two-way fixed effects (TWFE) estimator can produce severely biased estimates when treatment effects are heterogeneous across cohorts or over time. Using Monte Carlo simulations calibrated to common research designs in finance, the paper shows that TWFE can yield the wrong sign, wrong magnitude, or misleading inference. The paper then benchmarks several newly proposed estimators — including Callaway and Sant'Anna (2021), Sun and Abraham (2021), and de Chaisemartin and D'Haultfoeuille (2020) — that remain valid under heterogeneity.

Why the Baker et al. (2022) paper matters: It provided an influential comparison of heterogeneity-robust DiD estimators in a single simulation framework, demonstrating that the choice of estimator can flip the sign of estimated effects. The paper helped drive widespread adoption of robust estimators in applied finance research.

What you will do:

Simulate a staggered adoption panel with heterogeneous treatment effects
Estimate treatment effects using naive TWFE and observe the bias
Decompose the TWFE estimator using the Goodman-Bacon decomposition
Estimate treatment effects using the Callaway-Sant'Anna estimator
Estimate treatment effects using the Sun-Abraham interaction-weighted estimator
Compare all estimators against the known true treatment effect

Step 1: Simulate the Staggered Adoption Panel

The DGP features a balanced panel of 1,000 units observed over 30 periods. Units adopt treatment in one of three cohorts (periods 10, 20, and 30), and treatment effects differ by cohort and grow over time since treatment.

1# First-time setup: install.packages(c("fixest", "did", "data.table"))
2library(fixest)
3library(did)
4library(data.table)
5
6set.seed(2022)
7
8# Panel dimensions (following Baker et al. 2022 Section III)
9N <- 1000; TT <- 30
10# 250 units per group: cohorts 10, 20, 30, and never-treated (0)
11cohort_assign <- sample(rep(c(10, 20, 30, 0), each = N / 4))
12
13dt <- CJ(unit = 1:N, time = 1:TT)
14dt[, cohort := cohort_assign[unit]]
15dt[, treated := as.integer(cohort > 0 & time >= cohort)]
16
17# Unit and time fixed effects
18dt[, alpha_i := rnorm(1), by = unit]       # permanent unit heterogeneity
19dt[, delta_t := rnorm(1, 0, 0.5), by = time]  # common time shocks
20
21# Heterogeneous treatment effects: differ by cohort and grow over time
22dt[, time_since := ifelse(treated, time - cohort, 0)]
23dt[, tau := fcase(
24cohort == 10 & treated == 1, 2 + 0.3 * time_since,  # early: moderate base, fast growth
25cohort == 20 & treated == 1, 5 + 0.1 * time_since,  # middle: large base, slow growth
26cohort == 30 & treated == 1, 1.0,                     # late: small constant effect
27defaultValue = 0
28)]
29
30# Outcome = unit FE + time FE + treatment effect + noise
31dt[, y := alpha_i + delta_t + tau * treated + rnorm(.N)]
32
33true_att <- dt[treated == 1, mean(tau)]
34cat("Panel:", N, "units x", TT, "periods =", nrow(dt), "obs\n")
35cat("True ATT:", round(true_att, 3), "\n")

Requiresfixest did data.table

Expected output:

Panel: 1000 units x 30 periods = 30000 obs
Cohort 10: 250 units
Cohort 20: 250 units
Cohort 30: 250 units
Never-treated: 250 units
True ATT (among treated obs): ~5.05

The true ATT depends on the mix of cohort-specific effects and the time elapsed since treatment. Cohort 20 has the largest instantaneous effect (5.0), cohort 10 has the largest cumulative effect (due to the 0.3 growth rate over 20 periods of treatment), and cohort 30 has only one treated period with a small effect of 1.0.

Step 2: Run Naive TWFE and Observe the Bias

The standard TWFE regression includes unit and time fixed effects with a single treatment indicator. Under homogeneous treatment effects, TWFE recovers the ATT. Under heterogeneity, it does not.

1# Naive TWFE: single treatment indicator with unit and time absorbed
2twfe <- feols(y ~ treated | unit + time, data = dt)
3summary(twfe)
4
5# Compare TWFE against the true ATT — expect substantial downward bias
6cat("\nTrue ATT:", round(true_att, 3), "\n")
7cat("TWFE estimate:", round(coef(twfe)["treated"], 3), "\n")
8cat("Bias:", round(coef(twfe)["treated"] - true_att, 3), "\n")

Expected output — TWFE vs. True ATT:

Estimator	Coefficient	SE	True ATT	Bias
Naive TWFE	~3.20	~0.06	~5.05	~ -1.85

TWFE coefficient: 3.20 (SE = 0.06)
True ATT:         5.05
Bias:             -1.85
Bias (%):         -36.6%

The TWFE estimator substantially underestimates the true ATT. The bias arises because TWFE uses already-treated units (whose treatment effects have been growing) as implicit controls for newly-treated units. The resulting "forbidden comparisons" produce negative weights on some treatment effects, pulling the estimate downward.

Concept Check

Why does the TWFE estimator produce biased estimates when treatment effects are heterogeneous across cohorts?

Because TWFE does not include unit fixed effects, so it suffers from omitted variable bias.Because TWFE implicitly uses already-treated units as controls for later-treated units. When earlier cohorts have different (often larger) treatment effects, the already-treated comparison group creates 'forbidden comparisons' that introduce negative weights.Because the parallel trends assumption is violated in the staggered setting.Because staggered designs have fewer observations than a single-treatment-date design.

Step 3: Goodman-Bacon Decomposition

The Goodman-Bacon (2021) decomposition shows exactly which 2x2 DiD comparisons make up the TWFE estimate and what weight each receives.

1# First-time setup: install.packages(c("bacondecomp"))
2# Goodman-Bacon decomposition: which 2x2 comparisons compose the TWFE estimate?
3library(bacondecomp)
4
5dt_agg <- as.data.frame(dt[, .(unit, time, treated, y)])
6
7# Each row: a comparison type, its 2x2 DiD estimate, and its weight in TWFE
8bacon_out <- bacon(y ~ treated, data = dt_agg,
9                 id_var = "unit", time_var = "time")
10print(bacon_out)
11
12# Verify: weighted average of all comparisons equals the TWFE coefficient
13cat("\nWeighted avg:", round(sum(bacon_out$estimate * bacon_out$weight), 3))
14cat("\nTrue ATT:", round(true_att, 3), "\n")

Requiresbacondecomp did

Expected output — Decomposition:

Comparison	DiD Estimate	Weight
Cohort 10 vs Never-treated	~5.40	~0.51
Cohort 20 vs Never-treated	~5.30	~0.36
Cohort 30 vs Never-treated	~1.00	~0.13

The decomposition reveals that each treated-vs-never-treated comparison recovers something close to the cohort-specific ATT. However, the full TWFE also includes treated-vs-treated comparisons (not shown in the simplified version above) where already-treated units serve as controls. When the treatment effect for early cohorts has grown large, subtracting it off introduces negative bias.

Step 4: Callaway-Sant'Anna and Sun-Abraham Estimators

Modern heterogeneity-robust estimators avoid comparing already-treated to newly-treated units. The Callaway-Sant'Anna estimator computes group-time ATTs separately for each cohort at each time period, then aggregates. The Sun-Abraham estimator uses an interaction-weighted approach.

1# Callaway-Sant'Anna: group-time ATTs using only never-treated as controls
2cs_out <- att_gt(
3yname = "y",
4tname = "time",
5idname = "unit",
6gname = "cohort",   # 0 = never-treated
7data = as.data.frame(dt),
8control_group = "nevertreated"
9)
10
11# Aggregate all group-time ATTs to a single overall ATT
12cs_agg <- aggte(cs_out, type = "simple")
13summary(cs_agg)
14cat("True ATT:", round(true_att, 3), "\n")
15
16# Sun-Abraham: interaction-weighted estimator via fixest::sunab()
17# Code never-treated as 10000 (sunab convention for "never treated")
18dt[, cohort_sa := fifelse(cohort == 0, 10000, cohort)]
19sa_model <- feols(y ~ sunab(cohort_sa, time) | unit + time,
20                data = dt)
21# Aggregate to a single ATT
22summary(sa_model, agg = "ATT")
23cat("True ATT:", round(true_att, 3), "\n")

Requiresfixest

Expected output — Estimator comparison:

Estimator	ATT Estimate	True ATT	Bias
Naive TWFE	~3.20	~5.05	~ -1.85
Callaway-Sant'Anna	~5.00	~5.05	~ -0.05
Sun-Abraham	~5.02	~5.05	~ -0.02

Callaway-Sant'Anna ATT: 5.00
Sun-Abraham ATT: 5.02
True ATT: 5.05

Both heterogeneity-robust estimators recover the true ATT with minimal bias, while the naive TWFE estimate is off by over 35%. The small remaining differences in the robust estimators reflect sampling variability, not systematic bias.

Concept Check

What is the key difference between how TWFE and the Callaway-Sant'Anna estimator handle already-treated units?

TWFE drops already-treated units; CS includes them with lower weights.TWFE uses already-treated units as controls for newly-treated units (creating forbidden comparisons), whereas CS only uses never-treated (or not-yet-treated) units as controls.TWFE uses OLS while CS uses a nonparametric estimator.CS is a panel estimator while TWFE is a cross-sectional estimator.

Step 5: Compare All Estimators and Dynamic Effects

Bring all results together and estimate dynamic (event-study) treatment effects using the robust estimators.

1# Dynamic effects: CS event study by relative time (periods since treatment)
2cs_dyn <- aggte(cs_out, type = "dynamic")
3summary(cs_dyn)
4
5# Plot: pre-treatment coefficients should be near zero; post-treatment should grow
6ggdid(cs_dyn) +
7labs(title = "Callaway-Sant'Anna Dynamic Treatment Effects",
8     x = "Periods Relative to Treatment",
9     y = "ATT")
10
11# Final comparison: robust estimators vs biased TWFE
12cat("\n=== Final Comparison ===\n")
13cat("TWFE:", round(coef(twfe)["treated"], 3), "\n")
14cat("CS:", round(cs_agg$overall.att, 3), "\n")
15cat("True ATT:", round(true_att, 3), "\n")

Expected output — Dynamic effects (selected relative times):

Relative Time	CS Estimate	True ATT(e)
0	~2.70	~2.67
1	~2.85	~2.83
5	~4.10	~4.00
10	~5.30	~5.17
15	~6.50	~6.33
20	~8.00	~8.00

Final estimator comparison:

Estimator	Estimate	Bias	Bias %
Naive TWFE	~3.20	~ -1.85	~ -37%
Callaway-Sant'Anna	~5.00	~ -0.05	~ -1%
Sun-Abraham	~5.02	~ -0.02	~ -0.4%
True ATT	~5.05	—	—

The dynamic effects show treatment effects growing with time since treatment, which is a key source of heterogeneity that TWFE fails to accommodate. The CS estimator tracks the true dynamic path closely.

Concept Check

Baker et al. (2022) recommend reporting results from multiple robust estimators rather than relying on a single approach. Why might different robust estimators give slightly different results even when all are consistent?

Because some robust estimators are biased and others are not.Because they target slightly different parameters (e.g., different weighting schemes across cohorts and time periods), use different control groups (never-treated vs. not-yet-treated), or make different assumptions about the underlying model.Because random sampling variation ensures no two estimators give the same result.Because robust estimators are less efficient than TWFE.

Summary

The replication of Baker et al. (2022) yields several key takeaways:

TWFE bias is substantial. When treatment effects differ across cohorts or evolve over time, naive TWFE can underestimate (or even reverse the sign of) the true ATT by 30% or more.
The source of bias is forbidden comparisons. TWFE implicitly uses already-treated units as controls, and the Goodman-Bacon decomposition reveals the resulting negative weights.
Robust estimators fix the problem. Both Callaway-Sant'Anna and Sun-Abraham recover the true ATT by restricting comparisons to appropriate control groups and aggregating properly.
Report multiple estimators. Baker et al. (2022) recommend running several robust estimators as a robustness check. When all approaches agree, confidence in the findings is strengthened.
Dynamic effects matter. Event-study-style reporting reveals treatment effect dynamics that a single TWFE coefficient obscures.

Extension Exercises

Vary the degree of heterogeneity. Set all cohort effects equal (homogeneous case) and verify that TWFE recovers the correct ATT. Gradually increase heterogeneity and plot the TWFE bias as a function of the degree of heterogeneity.
Not-yet-treated controls. Re-run the Callaway-Sant'Anna estimator using not-yet-treated units as controls instead of never-treated units. Compare the estimates and discuss when each choice is appropriate.
de Chaisemartin-D'Haultfoeuille estimator. Implement the estimator from de Chaisemartin and D'Haultfoeuille (2020) using did_multiplegt_dyn (Stata) or DIDmultiplegtDYN (R) — the modern dynamic implementations — or a manual implementation. Compare with the CS and SA estimates.
Treatment effect reversal. Modify the DGP so that cohort 10 has a negative treatment effect while cohorts 20 and 30 have positive effects. Show that TWFE can yield a positive estimate even when the true ATT is negative.
Pre-testing. Add a pre-treatment "placebo" effect for one cohort and verify that the CS estimator's pre-trend test detects the violation.
Imputation estimator. Implement the Borusyak et al. (2024) imputation estimator and compare with the other robust methods.
Stacking estimator. Implement the Cengiz et al. (2019) stacking approach: create separate DiD datasets for each cohort and stack them with cohort-specific fixed effects. Compare the stacking estimate with CS and SA.
Real-world application. Apply all five estimators (TWFE, CS, SA, dCDH, imputation) to a real staggered adoption dataset (e.g., state-level policy adoption) and discuss which conclusions change when switching from TWFE to robust estimators.

Overview#

Step 1: Simulate the Staggered Adoption Panel#

Step 2: Run Naive TWFE and Observe the Bias#

Step 3: Goodman-Bacon Decomposition#

Step 4: Callaway-Sant'Anna and Sun-Abraham Estimators#

Step 5: Compare All Estimators and Dynamic Effects#

Summary#

Extension Exercises#

Overview

Step 1: Simulate the Staggered Adoption Panel

Step 2: Run Naive TWFE and Observe the Bias

Step 3: Goodman-Bacon Decomposition

Step 4: Callaway-Sant'Anna and Sun-Abraham Estimators

Step 5: Compare All Estimators and Dynamic Effects

Summary

Extension Exercises