MethodAtlas
Method·advanced·10 min read
Design-BasedModern

Staggered DiD

Under staggered adoption with heterogeneous effects, traditional TWFE can produce biased estimates — modern estimators correct for this.

When to UseWhen treatment is adopted at different times by different units (staggered rollout) and treatment effects may be heterogeneous across cohorts or over time. A recommended first step is to run the Goodman-Bacon decomposition as a diagnostic.
AssumptionParallel trends holds for each cohort separately; no anticipation effects; irreversibility (units do not switch treatment off once adopted). The not-yet-treated group is the preferred control.
MistakeUsing standard TWFE when treatment effects are heterogeneous across cohorts or over time — already-treated units serve as implicit controls, contaminating estimates. Run a Goodman-Bacon decomposition before choosing an estimator.
Reading Time~10 min read · 11 sections · 7 interactive exercises

One-Line Implementation

Ratt_gt(yname='y', tname='year', idname='unit', gname='first_treat', data=df, control_group='notyettreated')
Statacsdid y x1 x2, ivar(unit) time(year) gvar(first_treat) method(dripw)
Python# No standard Python package; use R did via rpy2 or differences

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Staggered Anti-Discrimination Laws

Imagine you are studying the effect of state-level anti-discrimination laws on hiring outcomes for minority workers. These laws were not all passed at once — different states adopted them at different times between 1970 and 2000. This pattern is , and it is incredibly common in policy evaluation. The canonical 2x2 DiD framework handles a single treatment date cleanly, but staggered rollout introduces new complications.

You might naturally reach for a regression:

Yit=αi+λt+δDit+εitY_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + \varepsilon_{it}

where Dit=1D_{it} = 1 once state ii has adopted the law. This specification seems perfectly reasonable. You have unit fixed effects to absorb time-invariant state differences, year fixed effects to absorb common shocks, and δ\delta captures the treatment effect.

But here is the surprising result that has reshaped applied econometrics since roughly 2019: when treatment effects are heterogeneous across cohorts or evolve dynamically over time, the TWFE regression can give you a negative estimate even when the true treatment effect is positive for every single unit (Goodman-Bacon, 2021); (de Chaisemartin & D'Haultfoeuille, 2020).

This result is not a typo. It is a consequence of how TWFE constructs comparisons under staggered timing — specifically, the use of already-treated units as controls, whose changing outcomes contaminate the estimate. Understanding this problem is now important for applied researchers working with staggered adoption designs.


AOverview

The core issue is the following: when treatment rolls out at different times, TWFE does not just compare treated units to units. It also makes "forbidden comparisons" — comparing later-treated units to already-treated units. When already-treated units serve as controls, their treatment effects contaminate the estimate.

If treatment effects are homogeneous (the same for every unit at every point in time), the weighting does not matter. But if effects differ across cohorts or evolve over time — which is common in practice — TWFE can assign negative weights to some treatment effects, producing an estimate that is a distorted average of the true effects. Visualizing these dynamics through an event study is essential for diagnosing the problem.

Goodman-Bacon (2021) showed that the TWFE estimator is a weighted average of all possible 2x2 Difference-in-Differences (DiD) comparisons, including the problematic ones where early-treated units serve as controls for later-treated units.

Common Confusions

"Does TWFE always give wrong answers?" No. If treatment effects are truly homogeneous across cohorts and over time, TWFE works fine. Running a sensitivity analysis on the degree of heterogeneity can help you assess how much the bias matters in your specific application. The problem arises specifically when effects are heterogeneous. Unfortunately, heterogeneous effects are common in many applied settings — treatment effects often vary across cohorts, grow or fade over time, or depend on implementation context. And the difficulty is that you usually cannot tell in advance whether the heterogeneity is severe enough to matter.

"Can I just add cohort-by-time interactions?" Adding interactions can help, but it changes what you are estimating and can create its own problems. The modern estimators are specifically designed to handle this complication correctly.

"Do I need to re-do all my old papers?" Not necessarily. It is advisable to run the on your old estimates to check whether the problematic comparisons drive your results. If the weights are mostly positive and the problematic comparisons are a small share, your original results may be fine.

"Which modern estimator should I use?" There is no single best answer. Callaway and Sant'Anna (2021) is among the most flexible and widely adopted options as of 2024. de Chaisemartin and D'Haultfoeuille (2020) is well-suited when you want a simple overall average effect. Borusyak et al. (2024) offer an imputation-based approach that is efficient and intuitive. We discuss each below.


BIdentification

The Goodman-Bacon Decomposition

Goodman-Bacon (2021) proved that the TWFE estimator δ^TWFE\hat{\delta}^{TWFE} can be written as:

δ^TWFE=klkwklδ^kl\hat{\delta}^{TWFE} = \sum_{k} \sum_{l \neq k} w_{kl} \cdot \hat{\delta}_{kl}

where δ^kl\hat{\delta}_{kl} is the simple 2x2 DiD estimate comparing cohort kk (treated) to cohort ll (control), and wklw_{kl} are non-negative weights (proportional to group sizes and treatment timing variance) that sum to one. The Goodman-Bacon weights themselves are never negative; the problem is that some δ^kl\hat{\delta}_{kl} comparisons use already-treated units as controls, which biases those individual estimates. (Separately, de Chaisemartin and D'Haultfoeuille (2020) show a different decomposition where TWFE weights on individual treatment effects can be negative.)

The weights depend on:

  • Group sizes (larger groups get more weight)
  • Variance in treatment timing (more variation = more weight)
  • Whether the comparison is "clean" (treated vs. never-treated) or "contaminated" (treated vs. already-treated)

The contaminated comparisons are the problem. When an already-treated unit's outcomes have been affected by treatment, using it as a control biases the comparison.

Modern Solutions

All modern estimators share a common strategy: avoid the forbidden comparisons. They differ in how they achieve this goal.

Callaway and Sant'Anna (2021) estimate group-time Average Treatment Effect on the Treated (ATT) parameters ATT(g,t)ATT(g,t) — the effect for cohort gg at time tt — using only never-treated or units as controls. These building blocks are then aggregated into summary measures.

de Chaisemartin and D'Haultfoeuille (2020) estimate a different parameter: the average effect of switching treatment on among the switchers. They show that this parameter is robust to heterogeneous effects.

Key Assumptions

All approaches require:

  1. Parallel trends for each cohort: In the absence of treatment, each cohort would have followed the same trend as its control group.
  2. No anticipation: Treatment has no effect before it is implemented.
  3. Irreversibility (often): Units do not switch treatment off once adopted. (Some estimators, such as de Chaisemartin and D'Haultfoeuille, can accommodate treatment reversals.)
  4. SUTVA (Stable Unit Treatment Value Assumption): No interference between units — one unit's treatment assignment does not affect another unit's outcomes — and no hidden variations of treatment. If an anti-discrimination law in one state causes firms or workers to relocate to other states, those contaminate the control group and bias the estimated treatment effect.

CVisual Intuition

The Goodman-Bacon decomposition provides an illuminating diagnostic plot. It shows each 2x2 comparison (treated vs. control cohort pair) as a dot, with the x-axis showing the weight and the y-axis showing the 2x2 estimate. The TWFE estimate is the weighted average.

If the dots are clustered together, TWFE is fine. If the "clean" comparisons (treated vs. never-treated) give very different estimates from the "contaminated" comparisons (treated vs. already-treated), and the contaminated comparisons have large weights, you have a problem.


DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: TWFE takes a weighted average of all possible two-group, two-period DiD estimates. The Goodman-Bacon weights on these 2×2 comparisons are non-negative, but some comparisons use already-treated units as controls, producing contaminated estimates. Under the de Chaisemartin and D'Haultfoeuille (2020) decomposition, the implied weights on individual treatment effects can be negative.

Consider a setting with three groups: Early treated (E), Late treated (L), and Never treated (N). Denote the treatment adoption dates as GE<GLG_E < G_L.

The TWFE regression is:

Yit=αi+λt+δDit+εitY_{it} = \alpha_i + \lambda_t + \delta D_{it} + \varepsilon_{it}

Goodman-Bacon (2021) shows:

δ^TWFE=wENδ^EN+wLNδ^LN+wELpreδ^ELpre+wELpostδ^ELpost\hat{\delta}^{TWFE} = w_{EN} \hat{\delta}_{EN} + w_{LN} \hat{\delta}_{LN} + w_{EL}^{pre} \hat{\delta}_{EL}^{pre} + w_{EL}^{post} \hat{\delta}_{EL}^{post}

where:

  • δ^EN\hat{\delta}_{EN}: DiD comparing Early to Never-treated (clean)
  • δ^LN\hat{\delta}_{LN}: DiD comparing Late to Never-treated (clean)
  • δ^ELpre\hat{\delta}_{EL}^{pre}: DiD comparing Early to Late, before Late is treated (clean)
  • δ^ELpost\hat{\delta}_{EL}^{post}: DiD comparing Late to Early, after Early is already treated (contaminated)

The last term is problematic. After GEG_E, the Early group's outcomes have been shifted by their treatment effect. If this effect is growing over time (dynamic effects), then δ^ELpost\hat{\delta}_{EL}^{post} subtracts the change in the Early group's treatment effect from the Late group's treatment effect.

In extreme cases, if the Early group's effect is growing fast, δ^ELpost\hat{\delta}_{EL}^{post} can be negative even if the true effect for Late adopters is positive. When this term gets enough weight, the overall TWFE estimate can be negative.

Callaway and Sant'Anna (2021) solution: Estimate ATT(g,t)ATT(g,t) for each group gg at each time tt using only clean comparisons:

ATT(g,t)=E[YtYg1Gi=g]E[YtYg1Gi=]ATT(g,t) = E[Y_t - Y_{g-1} \mid G_i = g] - E[Y_t - Y_{g-1} \mid G_i = \infty]

where Gi=G_i = \infty denotes never-treated units (or not-yet-treated, depending on the specification). These group-time effects can then be aggregated:

δ^CS=gtwg,tATT^(g,t)\hat{\delta}^{CS} = \sum_g \sum_t w_{g,t} \cdot \widehat{ATT}(g,t)

with only positive weights.


EImplementation

# Requires: did, fixest, bacondecomp, ggplot2
library(did)          # did: Callaway & Sant'Anna (2021) estimator
library(fixest)       # fixest: Sun & Abraham (2021) via sunab()
library(bacondecomp)  # bacondecomp: Goodman-Bacon (2021) decomposition
library(ggplot2)

# --- Step 1: Goodman-Bacon decomposition (diagnostic) ---
# Decomposes the TWFE estimate into all 2x2 DiD comparisons
# Reveals whether TWFE uses problematic already-treated units as controls
bacon_out <- bacon(y ~ D, data = df, id_var = "unit_id", time_var = "year")

# Plot: weight on x-axis, estimate on y-axis, colored by comparison type
# If "Already treated vs Later treated" comparisons have large weight and yield negative estimates,
# TWFE may produce biased or even sign-reversed aggregate estimates
ggplot(bacon_out) + aes(x = weight, y = estimate, color = type) +
geom_point() + geom_hline(yintercept = 0)

# --- Step 2: Callaway & Sant'Anna estimator ---
# att_gt() estimates group-time ATT(g,t) using only clean comparisons
# gname = column with first treatment period (0 or Inf for never-treated)
# control_group = "notyettreated": uses not-yet-treated units as controls
# (gives more power but requires treatment timing is exogenous)
cs_out <- att_gt(
yname = "y", tname = "year", idname = "unit_id", gname = "first_treat",
data = df, control_group = "notyettreated"
)
summary(cs_out)
ggdid(cs_out)  # Event-study plot of all group-time ATTs

# --- Step 3: Aggregate to overall ATT ---
# aggte() aggregates the group-time estimates into a single summary
# type = "simple": equal-weighted average across all (g,t) cells
agg_cs <- aggte(cs_out, type = "simple")
summary(agg_cs)
# The aggregated ATT is robust to heterogeneous treatment effects

# --- Step 4: Sun & Abraham via fixest ---
# sunab() implements interaction-weighted estimation within feols()
# Avoids the "forbidden comparisons" that contaminate standard TWFE
est_sa <- feols(y ~ sunab(first_treat, year) | unit_id + year,
              data = df, vcov = ~state)
iplot(est_sa)
# Compare with TWFE: if estimates differ, treatment effect heterogeneity is present

FDiagnostics

  1. Run the Goodman-Bacon decomposition first. Before reaching for a modern estimator, decompose your TWFE estimate. If the problematic (already-treated as control) comparisons have small weights, TWFE may be fine.

  2. Compare TWFE to modern estimators. If they give similar results, the heterogeneity bias is small. If they diverge, report the modern estimator.

  3. Check group-time effects. Plot all the ATT(g,t)ATT(g,t) estimates from Callaway and Sant'Anna (2021). This disaggregation reveals heterogeneity across cohorts and over time.

  4. Test for pre-trends within each cohort. The overall event study might look clean, but individual cohorts might have pre-trends that cancel each other out.

  5. Sensitivity to control group choice. Compare results using never-treated vs. not-yet-treated as the control group. If they diverge substantially, investigate why.

Interpreting Your Results

TWFE and modern estimator agree: Encouraging, but not conclusive on its own. Agreement suggests the heterogeneity bias may be small in your setting, especially if the Goodman-Bacon decomposition confirms that contaminated comparisons receive little weight. Report both estimators and note the agreement, but do not treat agreement alone as proof that TWFE is unbiased — agreement can also occur when both estimators are affected by a common violation (e.g., a parallel trends failure that affects all cohorts similarly).

TWFE and modern estimator diverge: Report the modern estimator as your main result. Show the Goodman-Bacon decomposition to explain why TWFE differs. This divergence is actually a compelling narrative for your paper — it shows you understand the methodology.

Group-time effects vary substantially: This heterogeneity is often substantively interesting. Why do early adopters have different effects than late adopters? Is it because the policy is different, the context is different, or the treated populations are different?


GWhat Can Go Wrong

What Can Go Wrong

Negative TWFE Estimate Despite Uniformly Positive Treatment Effects

Researcher studies staggered adoption of right-to-carry (RTC) gun laws across US states from 1980 to 2010 using the Callaway and Sant'Anna (2021) estimator with not-yet-treated states as controls. Treatment effects are allowed to vary by adoption cohort and time since treatment.

Group-time ATTs reveal that early-adopting states (1980s cohort) show violent crime reductions of -8% that grow to -12% over 10 years, while late-adopting states (2000s cohort) show smaller reductions of -3%. The overall ATT aggregated with proper positive weights is -6.2%.

What Can Go Wrong

Using Never-Treated as Controls When They Are Systematically Different

Researcher studying staggered Medicaid expansion uses both never-treated and not-yet-treated states as alternative control groups and compares results. They also test for pre-trends separately for each treatment cohort.

Results using not-yet-treated controls show ATT of +4.8 pp in insurance coverage. Results using never-treated controls show ATT of +7.1 pp. The discrepancy arises because the 14 states that never expanded Medicaid are systematically more conservative and had different baseline coverage trends. Cohort-specific pre-trend tests reject parallel trends for the never-treated control group.

What Can Go Wrong

Ignoring Treatment Effect Heterogeneity Across Cohorts

Researcher studying the effect of state minimum wage increases on teen employment estimates cohort-specific effects and discovers that states raising minimum wages during recessions (2008-2010 cohort) show employment effects of -2.1%, while states raising during expansions (2014-2016 cohort) show effects of -0.4%.

The researcher reports the heterogeneity and discusses how macroeconomic conditions moderate the employment effect of minimum wages, providing evidence relevant to the policy debate about timing of wage increases.


HPractice

Concept Check

You are studying the effect of state-level policies adopted between 2005 and 2015. Your TWFE estimate is 0.02 (p = 0.03). The Goodman-Bacon decomposition shows that 60% of the weight comes from comparisons where already-treated states serve as controls, and those comparisons yield estimates near zero. Clean comparisons (treated vs. never-treated) yield estimates around 0.05. What should you conclude?

Concept Check

Why does two-way fixed effects (TWFE) regression produce biased estimates under staggered treatment adoption with heterogeneous effects?

Guided Exercise

Staggered DiD: State Marijuana Legalization and Traffic Fatalities

A public health researcher studies whether legalizing recreational marijuana increases traffic fatalities. Between 2012 and 2020, 14 states legalized recreational marijuana at different times. She has annual traffic fatality rates for all 50 states from 2005 to 2022. She estimates a two-way fixed effects (TWFE) regression and finds a positive but imprecise effect.

Why is standard TWFE potentially biased in this staggered adoption setting?

What does 'heterogeneous treatment effects' mean in this context?

Name one modern estimator that handles staggered adoption correctly and what it does differently from TWFE.

If you find TWFE = +0.8 but Callaway-Sant'Anna = +2.1, what does the difference suggest?

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the effect of state-level paid family leave mandates on female labor force participation. Six states adopted paid leave between 2004 and 2018. The researcher estimates a TWFE regression: `Y_it = alpha_i + lambda_t + delta * D_it + epsilon_it`, clustered at the state level. They find delta = 0.023 (p = 0.04). They then estimate an event study using the same TWFE specification and report: "The event-study plot shows no pre-trends and a persistent positive effect. The TWFE estimate of 2.3 percentage points is our preferred specification because it is more efficient than the Callaway and Sant'Anna estimator, which yields a noisier estimate of 3.1 percentage points."

Select all errors you can find:

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the staggered rollout of electronic health record (EHR) mandates across 30 hospitals in a health system from 2012 to 2019. They use the Callaway and Sant'Anna estimator with not-yet-treated hospitals as controls. They report: "The aggregate ATT is a 12% reduction in medication errors (p < 0.01). We find no evidence of heterogeneity across cohorts (F-test p = 0.34)." They do not report group-time specific effects or discuss which hospitals adopted early versus late. Their dataset has 5 hospitals that adopted in 2012, 8 in 2014, 10 in 2016, and 7 in 2019.

Select all errors you can find:

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors study the staggered rollout of state-level renewable portfolio standards (RPS) on electricity prices using data from 50 states over 2000-2020. Twenty-nine states adopted RPS between 2002 and 2015. They implement both TWFE and Callaway-Sant'Anna estimators. The TWFE estimate suggests RPS increases electricity prices by 1.4 cents/kWh (p = 0.02). The Callaway-Sant'Anna estimate is 2.8 cents/kWh (p = 0.01). They report the Callaway-Sant'Anna result as their preferred specification.

Key Table

EstimatorATT (cents/kWh)SEp-value
TWFE1.40.60.02
Callaway-Sant'Anna2.81.10.01
Sun-Abraham2.50.90.01
Goodman-Bacon Decomposition:
  Treated vs Never-treated:    3.1 (weight: 0.35)
  Treated vs Not-yet-treated:  2.4 (weight: 0.25)
  Already-treated vs Later:    -0.6 (weight: 0.40)

Authors' Identification Claim

Parallel trends are supported by flat pre-treatment event-study coefficients in the Callaway-Sant'Anna specification. The large discrepancy between TWFE and modern estimators demonstrates the importance of using heterogeneity-robust methods.


ISwap-In: When to Use Something Else

  • Canonical 2×2 DiD: When treatment is adopted simultaneously by all treated units — the classic two-group, two-period design avoids the negative-weighting issues of staggered settings.
  • Event studies: When the full time profile of dynamic treatment effects is of primary interest, rather than a single summary parameter.
  • Synthetic DiD: When the parallel trends assumption is suspect and reweighting control units to match the pre-treatment trajectory of treated units improves credibility.
  • Synthetic control: When the number of treated units is very small (one to five) and constructing a data-driven counterfactual is more transparent than assuming parallel trends.

JReviewer Checklist

Critical Reading Checklist

0 of 8 items checked0%

Paper Library

Foundational (6)

Borusyak, K., Jaravel, X., & Spiess, J. (2024). Revisiting Event-Study Designs: Robust and Efficient Estimation.

Review of Economic StudiesDOI: 10.1093/restud/rdae007

Borusyak, Jaravel, and Spiess propose an imputation estimator for staggered DID that first estimates unit and time fixed effects from untreated observations, then imputes the counterfactual outcomes. This approach is efficient, flexible, and avoids the negative weighting problem of TWFE.

Callaway, B., & Sant'Anna, P. H. C. (2021). Difference-in-Differences with Multiple Time Periods.

Journal of EconometricsDOI: 10.1016/j.jeconom.2020.12.001

Callaway and Sant'Anna propose group-time average treatment effects (ATT(g,t)) that avoid the problematic comparisons in TWFE. Their framework allows for heterogeneous treatment effects across groups and time and provides aggregation schemes for summary parameters.

de Chaisemartin, C., & D'Haultfoeuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.

American Economic ReviewDOI: 10.1257/aer.20181169

De Chaisemartin and D'Haultfoeuille show that the TWFE estimator can assign negative weights to some treatment effects, potentially producing estimates with the wrong sign. They propose an alternative estimator and a decomposition that reveals which group-time effects receive negative weights.

Dube, A., Girardi, D., Jordà, Ò., & Taylor, A. M. (2025). A Local Projections Approach to Difference-in-Differences.

Journal of Applied EconometricsDOI: 10.1002/jae.70000

Dube and colleagues propose a local projections (LP) approach to difference-in-differences estimation that combines LPs with a flexible 'clean control' condition to define appropriate treated and control units. The LP-DiD estimator subsumes many recent solutions to negative weighting problems, accommodates covariates and nonabsorbing treatments, and is simple to implement.

Goodman-Bacon, A. (2021). Difference-in-Differences with Variation in Treatment Timing.

Journal of EconometricsDOI: 10.1016/j.jeconom.2021.03.014

Goodman-Bacon decomposes the two-way fixed-effects DID estimator into a weighted average of all possible two-group, two-period DID comparisons, revealing that some comparisons use already-treated units as controls. The decomposition clarifies when already-treated units enter as controls and why this can make the estimator difficult to interpret under treatment-effect heterogeneity.

Sun, L., & Abraham, S. (2021). Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.

Journal of EconometricsDOI: 10.1016/j.jeconom.2020.09.006

Sun and Abraham show that conventional event-study regression coefficients are contaminated by treatment effect heterogeneity across cohorts and propose an interaction-weighted estimator that recovers clean dynamic treatment effects. This paper is the key reference for event-study plots in staggered settings.

Application (2)

Baker, A. C., Larcker, D. F., & Wang, C. C. Y. (2022). How Much Should We Trust Staggered Difference-in-Differences Estimates?.

Journal of Financial EconomicsDOI: 10.1016/j.jfineco.2022.01.004

Baker, Larcker, and Wang demonstrate that the staggered DID problems identified in the econometrics literature are empirically relevant in finance research. They re-analyzed prominent finance studies and show that results can change substantially when using robust estimators.

Deshpande, M., & Li, Y. (2019). Who Is Screened Out? Application Costs and the Targeting of Disability Programs.

American Economic Journal: Economic PolicyDOI: 10.1257/pol.20180076

Deshpande and Li use staggered closings of Social Security field offices across the United States to estimate the effects of application costs on disability program participation. The staggered timing of office closures provides quasi-experimental variation in application costs, and the paper demonstrates how treatment-timing variation can be leveraged for credible policy evaluation.

Survey (2)

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.

Roth, J., Sant'Anna, P. H. C., Bilinski, A., & Poe, J. (2023). What's Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature.

Journal of EconometricsDOI: 10.1016/j.jeconom.2023.03.008

Roth et al. synthesize the explosion of recent econometric work on DID in this comprehensive survey, covering staggered treatment timing, heterogeneous treatment effects, pre-trends testing, and new estimators. It is the essential starting point for understanding the modern DID literature.

Tags

design-basedpanelstaggered-treatmentheterogeneous-effects