When should I use Staggered DiD?

When treatment is adopted at different times by different units (staggered rollout) and treatment effects may be heterogeneous across cohorts or over time. A recommended first step is to run the Goodman-Bacon decomposition as a diagnostic.

What is the key assumption of Staggered DiD?

Parallel trends holds for each cohort separately; no anticipation effects; irreversibility (units do not switch treatment off once adopted). The not-yet-treated group is the preferred control.

What is the most common mistake with Staggered DiD?

Using standard TWFE when treatment effects are heterogeneous across cohorts or over time — already-treated units serve as implicit controls, contaminating estimates. Run a Goodman-Bacon decomposition before choosing an estimator.

Method·advanced·10 min read

Design-BasedModern

Staggered DiD

Under staggered adoption with heterogeneous effects, traditional TWFE can produce biased estimates — modern estimators correct for this.

When to Use: When treatment is adopted at different times by different units (staggered rollout) and treatment effects may be heterogeneous across cohorts or over time. A recommended first step is to run the Goodman-Bacon decomposition as a diagnostic.
Assumption: Parallel trends holds for each cohort separately; no anticipation effects; irreversibility (units do not switch treatment off once adopted). The not-yet-treated group is the preferred control.
Mistake: Using standard TWFE when treatment effects are heterogeneous across cohorts or over time — already-treated units serve as implicit controls, contaminating estimates. Run a Goodman-Bacon decomposition before choosing an estimator.
Reading Time: ~10 min read · 11 sections · 7 interactive exercises

One-Line Implementation

Ratt_gt(yname='y', tname='year', idname='unit', gname='first_treat', data=df, control_group='notyettreated')

Statacsdid y x1 x2, ivar(unit) time(year) gvar(first_treat) method(dripw)

Python# No standard Python package; use R did via rpy2 or differences

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Staggered Anti-Discrimination Laws

Imagine you are studying the effect of state-level anti-discrimination laws on hiring outcomes for minority workers. These laws were not all passed at once — different states adopted them at different times between 1970 and 2000. This pattern is , and it is incredibly common in policy evaluation. The canonical 2x2 DiD framework handles a single treatment date cleanly, but staggered rollout introduces new complications.

You might naturally reach for a regression:

Y_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + \varepsilon_{it}

where $D_{it} = 1$ once state $i$ has adopted the law. This specification seems perfectly reasonable. You have unit fixed effects to absorb time-invariant state differences, year fixed effects to absorb common shocks, and $\delta$ captures the treatment effect.

But here is the surprising result that has reshaped applied econometrics since roughly 2019: when treatment effects are heterogeneous across cohorts or evolve dynamically over time, the TWFE regression can give you a negative estimate even when the true treatment effect is positive for every single unit (Goodman-Bacon, 2021); (de Chaisemartin & D'Haultfoeuille, 2020).

The sign reversal is not a typo. It is a consequence of how TWFE constructs comparisons under staggered timing — specifically, the use of already-treated units as controls, whose changing outcomes contaminate the estimate. Understanding the problem is now important for applied researchers working with staggered adoption designs.

AOverview

The core issue is the following: when treatment rolls out at different times, TWFE does not just compare treated units to units. It also makes "forbidden comparisons" — comparing later-treated units to already-treated units. When already-treated units serve as controls, their treatment effects contaminate the estimate.

If treatment effects are homogeneous (the same for every unit at every point in time), the weighting does not matter. But if effects differ across cohorts or evolve over time — which is common in practice — TWFE can assign negative weights to some treatment effects, producing an estimate that is a distorted average of the true effects. Visualizing these dynamics through an event study is essential for diagnosing the problem.

Goodman-Bacon (2021) showed that the TWFE estimator is a weighted average of all possible 2x2 Difference-in-Differences (DiD) comparisons, including the problematic ones where early-treated units serve as controls for later-treated units.

Common Confusions

"Does TWFE always give wrong answers?" No. If treatment effects are truly homogeneous across cohorts and over time, TWFE works fine. Running a sensitivity analysis on the degree of heterogeneity can help you assess how much the bias matters in your specific application. The problem arises specifically when effects are heterogeneous. Unfortunately, heterogeneous effects are common in many applied settings — treatment effects often vary across cohorts, grow or fade over time, or depend on implementation context. And the difficulty is that you usually cannot tell in advance whether the heterogeneity is severe enough to matter.

"Can I just add cohort-by-time interactions?" Adding interactions can help, but it changes what you are estimating and can create its own problems. The modern estimators are specifically designed to handle this complication correctly.

"Do I need to re-do all my old papers?" Not necessarily. It is advisable to run the on your old estimates to check whether the problematic comparisons drive your results. If the weights are mostly positive and the problematic comparisons are a small share, your original results may be fine.

"Which modern estimator should I use?" There is no single best answer. Callaway and Sant'Anna (2021) is among the most flexible and widely adopted options as of 2024. de Chaisemartin and D'Haultfoeuille (2020) is well-suited when you want a simple overall average effect. Borusyak et al. (2024) offer an imputation-based approach that is efficient and intuitive. We discuss each below.

BIdentification

The Goodman-Bacon Decomposition

Goodman-Bacon (2021) proved that the TWFE estimator $\hat{\delta}^{TWFE}$ can be written as:

\hat{\delta}^{TWFE} = \sum_{k} \sum_{l \neq k} w_{kl} \cdot \hat{\delta}_{kl}

where $\hat{\delta}_{kl}$ is the simple 2x2 DiD estimate comparing cohort $k$ (treated) to cohort $l$ (control), and $w_{kl}$ are non-negative weights (proportional to group sizes and treatment timing variance) that sum to one. The Goodman-Bacon weights themselves are never negative; the problem is that some $\hat{\delta}_{kl}$ comparisons use already-treated units as controls, which biases those individual estimates. (Separately, de Chaisemartin and D'Haultfoeuille (2020) show a different decomposition where TWFE weights on individual treatment effects can be negative.)

The weights depend on:

Group sizes (larger groups get more weight)
Variance in treatment timing (more variation = more weight)
Whether the comparison is "clean" (treated vs. never-treated) or "contaminated" (treated vs. already-treated)

The contaminated comparisons are the problem. When an already-treated unit's outcomes have been affected by treatment, using it as a control biases the comparison.

Modern Solutions

All modern estimators share a common strategy: avoid the forbidden comparisons. They differ in how they achieve this goal.

Callaway and Sant'Anna (2021) estimate group-time Average Treatment Effect on the Treated (ATT) parameters $ATT(g,t)$ — the effect for cohort $g$ at time $t$ — using only never-treated or units as controls. These building blocks are then aggregated into summary measures.

de Chaisemartin and D'Haultfoeuille (2020) estimate a different parameter: the average effect of switching treatment on among the switchers. They show that this parameter is robust to heterogeneous effects.

Sun and Abraham (2021) propose an interaction-weighted (IW) event-study estimator that recovers the cohort-average dynamic ATT by interacting event-time dummies with cohort dummies and reweighting; this isolates the same-period treated-vs-not-yet-treated comparisons that standard TWFE event studies contaminate.

Key Assumptions

All approaches require:

Parallel trends for each cohort: In the absence of treatment, each cohort would have followed the same trend as its control group.
No anticipation: Treatment has no effect before it is implemented.
Irreversibility (often): Units do not switch treatment off once adopted. (Some estimators, such as de Chaisemartin and D'Haultfoeuille, can accommodate treatment reversals.)
SUTVA (Stable Unit Treatment Value Assumption): No interference between units — one unit's treatment assignment does not affect another unit's outcomes — and no hidden variations of treatment. If an anti-discrimination law in one state causes firms or workers to relocate to other states, those contaminate the control group and bias the estimated treatment effect.

CVisual Intuition

The Goodman-Bacon decomposition provides an illuminating diagnostic plot. It shows each 2x2 comparison (treated vs. control cohort pair) as a dot, with the x-axis showing the weight and the y-axis showing the 2x2 estimate. The TWFE estimate is the weighted average.

If the dots are clustered together, TWFE is fine. If the "clean" comparisons (treated vs. never-treated) give very different estimates from the "contaminated" comparisons (treated vs. already-treated), and the contaminated comparisons have large weights, you have a problem.

DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: TWFE takes a weighted average of all possible two-group, two-period DiD estimates. The Goodman-Bacon weights on these 2×2 comparisons are non-negative, but some comparisons use already-treated units as controls, producing contaminated estimates. Under the de Chaisemartin and D'Haultfoeuille (2020) decomposition, the implied weights on individual treatment effects can be negative.

Consider a setting with three groups: Early treated (E), Late treated (L), and Never treated (N). Denote the treatment adoption dates as $G_E < G_L$ .

The TWFE regression is:

Y_{it} = \alpha_i + \lambda_t + \delta D_{it} + \varepsilon_{it}

Goodman-Bacon (2021) shows:

\hat{\delta}^{TWFE} = w_{EN} \hat{\delta}_{EN} + w_{LN} \hat{\delta}_{LN} + w_{EL}^{pre} \hat{\delta}_{EL}^{pre} + w_{EL}^{post} \hat{\delta}_{EL}^{post}

where:

$\hat{\delta}_{EN}$ : DiD comparing Early to Never-treated (clean)
$\hat{\delta}_{LN}$ : DiD comparing Late to Never-treated (clean)
$\hat{\delta}_{EL}^{pre}$ : DiD comparing Early to Late, before Late is treated (clean)
$\hat{\delta}_{EL}^{post}$ : DiD comparing Late to Early, after Early is already treated (contaminated)

The last term is problematic. After $G_E$ , the Early group's outcomes have been shifted by their treatment effect. If this effect is growing over time (dynamic effects), then $\hat{\delta}_{EL}^{post}$ subtracts the change in the Early group's treatment effect from the Late group's treatment effect.

In extreme cases, if the Early group's effect is growing fast, $\hat{\delta}_{EL}^{post}$ can be negative even if the true effect for Late adopters is positive. When this term gets enough weight, the overall TWFE estimate can be negative.

Callaway and Sant'Anna (2021) solution: Estimate $ATT(g,t)$ for each group $g$ at each time $t$ using only clean comparisons:

ATT(g,t) = E[Y_t - Y_{g-1} \mid G_i = g] - E[Y_t - Y_{g-1} \mid G_i = \infty]

where $G_i = \infty$ denotes never-treated units (or not-yet-treated, depending on the specification). These group-time effects can then be aggregated:

\hat{\delta}^{CS} = \sum_g \sum_t w_{g,t} \cdot \widehat{ATT}(g,t)

with only positive weights.

EImplementation

1# Requires: did, fixest, bacondecomp, ggplot2
2library(did)          # did: Callaway & Sant'Anna (2021) estimator
3library(fixest)       # fixest: Sun & Abraham (2021) via sunab()
4library(bacondecomp)  # bacondecomp: Goodman-Bacon (2021) decomposition
5library(ggplot2)
6
7# --- Step 1: Goodman-Bacon decomposition (diagnostic) ---
8# Decomposes the TWFE estimate into all 2x2 DiD comparisons
9# Reveals whether TWFE uses problematic already-treated units as controls
10bacon_out <- bacon(y ~ D, data = df, id_var = "unit_id", time_var = "year")
11
12# Plot: weight on x-axis, estimate on y-axis, colored by comparison type
13# If "Already treated vs Later treated" comparisons have large weight and yield negative estimates,
14# TWFE may produce biased or even sign-reversed aggregate estimates
15ggplot(bacon_out) + aes(x = weight, y = estimate, color = type) +
16geom_point() + geom_hline(yintercept = 0)
17
18# --- Step 2: Callaway & Sant'Anna estimator ---
19# att_gt() estimates group-time ATT(g,t) using only clean comparisons
20# gname = column with first treatment period (0 or Inf for never-treated)
21# control_group = "notyettreated": uses not-yet-treated units as controls
22# (gives more power but requires treatment timing is exogenous)
23cs_out <- att_gt(
24yname = "y", tname = "year", idname = "unit_id", gname = "first_treat",
25data = df, control_group = "notyettreated"
26)
27summary(cs_out)
28ggdid(cs_out)  # Event-study plot of all group-time ATTs
29
30# --- Step 3: Aggregate to overall ATT ---
31# aggte() aggregates the group-time estimates into a single summary
32# type = "simple": equal-weighted average across all (g,t) cells
33agg_cs <- aggte(cs_out, type = "simple")
34summary(agg_cs)
35# The aggregated ATT is robust to heterogeneous treatment effects
36
37# --- Step 4: Sun & Abraham via fixest ---
38# sunab() implements interaction-weighted estimation within feols()
39# Avoids the "forbidden comparisons" that contaminate standard TWFE
40est_sa <- feols(y ~ sunab(first_treat, year) | unit_id + year,
41              data = df, vcov = ~state)
42iplot(est_sa)
43# Compare with TWFE: if estimates differ, treatment effect heterogeneity is present

Requiresdid fixest bacondecomp ggplot2

FDiagnostics

Run the Goodman-Bacon decomposition first. Before reaching for a modern estimator, decompose your TWFE estimate. If the problematic (already-treated as control) comparisons have small weights, TWFE may be fine.
Compare TWFE to modern estimators. If they give similar results, the heterogeneity bias is small. If they diverge, report the modern estimator.
Check group-time effects. Plot all the $ATT(g,t)$ estimates from Callaway and Sant'Anna (2021). This disaggregation reveals heterogeneity across cohorts and over time.
Test for pre-trends within each cohort. The overall event study might look clean, but individual cohorts might have pre-trends that cancel each other out.
Sensitivity to control group choice. Compare results using never-treated vs. not-yet-treated as the control group. If they diverge substantially, investigate why.

Interpreting Your Results

TWFE and modern estimator agree: Encouraging, but not conclusive on its own. Agreement suggests the heterogeneity bias may be small in your setting, especially if the Goodman-Bacon decomposition confirms that contaminated comparisons receive little weight. Report both estimators and note the agreement, but do not treat agreement alone as proof that TWFE is unbiased — agreement can also occur when both estimators are affected by a common violation (e.g., a parallel trends failure that affects all cohorts similarly).

TWFE and modern estimator diverge: Report the modern estimator as your main result. Show the Goodman-Bacon decomposition to explain why TWFE differs. This divergence is actually a compelling narrative for your paper — it shows you understand the methodology.

Group-time effects vary substantially: This heterogeneity is often substantively interesting. Why do early adopters have different effects than late adopters? Is it because the policy is different, the context is different, or the treated populations are different?

GWhat Can Go Wrong

Common Pitfalls

Blindly trusting TWFE. The fact that "everyone used to do it this way" is not a justification. The econometric community has collectively recognized that TWFE can mislead under staggered timing.
Over-correcting. Not every staggered DiD paper needs to be redone. If you have only two groups and two periods (classic 2x2), TWFE is fine. The problem arises with many cohorts and heterogeneous effects.
Ignoring composition effects. Early adopters may differ systematically from late adopters. This concern is not just a statistical issue — it is a substantive one about who selects into early versus late treatment.
Assuming never-treated units exist. Some estimators require a never-treated control group. If everyone is eventually treated, you can only use not-yet-treated units, which requires the no-anticipation assumption.
Reporting only aggregated effects. The richness of the group-time effects $ATT(g,t)$ is often more informative than a single summary number. Show the heterogeneity.

What Can Go Wrong

Negative TWFE Estimate Despite Uniformly Positive Treatment Effects

Researcher studies staggered adoption of right-to-carry (RTC) gun laws across US states from 1980 to 2010 using the Callaway and Sant'Anna (2021) estimator with not-yet-treated states as controls. Treatment effects are allowed to vary by adoption cohort and time since treatment.

Group-time ATTs reveal that early-adopting states (1980s cohort) show violent crime reductions of -8% that grow to -12% over 10 years, while late-adopting states (2000s cohort) show smaller reductions of -3%. The overall ATT aggregated with proper positive weights is -6.2%.

What Can Go Wrong

Using Never-Treated as Controls When They Are Systematically Different

Researcher studying staggered Medicaid expansion uses both never-treated and not-yet-treated states as alternative control groups and compares results. They also test for pre-trends separately for each treatment cohort.

Results using not-yet-treated controls show ATT of +4.8 pp in insurance coverage. Results using never-treated controls show ATT of +7.1 pp. The discrepancy arises because the 14 states that never expanded Medicaid are systematically more conservative and had different baseline coverage trends. Cohort-specific pre-trend tests reject parallel trends for the never-treated control group.

What Can Go Wrong

Ignoring Treatment Effect Heterogeneity Across Cohorts

Researcher studying the effect of state minimum wage increases on teen employment estimates cohort-specific effects and discovers that states raising minimum wages during recessions (2008-2010 cohort) show employment effects of -2.1%, while states raising during expansions (2014-2016 cohort) show effects of -0.4%.

The researcher reports the heterogeneity and discusses how macroeconomic conditions moderate the employment effect of minimum wages, providing evidence relevant to the policy debate about timing of wage increases.

HPractice

Concept Check

You are studying the effect of state-level policies adopted between 2005 and 2015. Your TWFE estimate is 0.02 (p = 0.03). The Goodman-Bacon decomposition shows that 60% of the weight comes from comparisons where already-treated states serve as controls, and those comparisons yield estimates near zero. Clean comparisons (treated vs. never-treated) yield estimates around 0.05. What should you conclude?

The TWFE estimate of 0.02 is correct because it is statistically significantThe TWFE estimate is likely biased downward, and a modern estimator would give an estimate closer to 0.05Drop the already-treated states and re-run TWFE on the remaining sampleThe TWFE estimate is fine because unit and time fixed effects fully account for treatment effect heterogeneity across cohorts

Concept Check

Why does two-way fixed effects (TWFE) regression produce biased estimates under staggered treatment adoption with heterogeneous effects?

Because TWFE cannot handle more than two time periodsBecause TWFE uses already-treated units as controls, creating 'forbidden comparisons' that bias the estimateBecause TWFE does not include unit fixed effectsBecause TWFE requires a balanced panel

Guided Exercise

Staggered DiD: State Marijuana Legalization and Traffic Fatalities

A public health researcher studies whether legalizing recreational marijuana increases traffic fatalities. Between 2012 and 2020, 14 states legalized recreational marijuana at different times. She has annual traffic fatality rates for all 50 states from 2005 to 2022. She estimates a two-way fixed effects (TWFE) regression and finds a positive but imprecise effect.

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the effect of state-level paid family leave mandates on female labor force participation. Six states adopted paid leave between 2004 and 2018. The researcher estimates a TWFE regression: `Y_it = alpha_i + lambda_t + delta * D_it + epsilon_it`, clustered at the state level. They find delta = 0.023 (p = 0.04). They then estimate an event study using the same TWFE specification and report: "The event-study plot shows no pre-trends and a persistent positive effect. The TWFE estimate of 2.3 percentage points is our preferred specification because it is more efficient than the Callaway and Sant'Anna estimator, which yields a noisier estimate of 3.1 percentage points."

Select all errors you can find:

Using standard TWFE with staggered adoption and likely heterogeneous effects(Estimation method and interpretation)

Preferring TWFE over the modern estimator on efficiency grounds(Specification choice justification)

Claiming 'no pre-trends' from a TWFE event study under staggered timing(Pre-trend analysis)

Error Detective

Read the analysis below carefully and identify the errors.

A researcher studies the staggered rollout of electronic health record (EHR) mandates across 30 hospitals in a health system from 2012 to 2019. They use the Callaway and Sant'Anna estimator with not-yet-treated hospitals as controls. They report: "The aggregate ATT is a 12% reduction in medication errors (p < 0.01). We find no evidence of heterogeneity across cohorts (F-test p = 0.34)." They do not report group-time specific effects or discuss which hospitals adopted early versus late. Their dataset has 5 hospitals that adopted in 2012, 8 in 2014, 10 in 2016, and 7 in 2019.

Select all errors you can find:

The 2019 cohort has no post-treatment data for testing(Sample construction and the 2019 cohort)

Not reporting group-time ATTs despite claiming no heterogeneity(Heterogeneity analysis and reporting)

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors study the staggered rollout of state-level renewable portfolio standards (RPS) on electricity prices using data from 50 states over 2000-2020. Twenty-nine states adopted RPS between 2002 and 2015. They implement both TWFE and Callaway-Sant'Anna estimators. The TWFE estimate suggests RPS increases electricity prices by 1.4 cents/kWh (p = 0.02). The Callaway-Sant'Anna estimate is 2.8 cents/kWh (p = 0.01). They report the Callaway-Sant'Anna result as their preferred specification.

Key Table

Estimator	ATT (cents/kWh)	SE	p-value
TWFE	1.4	0.6	0.02
Callaway-Sant'Anna	2.8	1.1	0.01
Sun-Abraham	2.5	0.9	0.01

Goodman-Bacon Decomposition:
  Treated vs Never-treated:    3.1 (weight: 0.35)
  Treated vs Not-yet-treated:  2.4 (weight: 0.25)
  Already-treated vs Later:    -0.6 (weight: 0.40)

Authors' Identification Claim

Parallel trends are supported by flat pre-treatment event-study coefficients in the Callaway-Sant'Anna specification. The large discrepancy between TWFE and modern estimators demonstrates the importance of using heterogeneity-robust methods.

ISwap-In: When to Use Something Else

Canonical 2×2 DiD: When treatment is adopted simultaneously by all treated units — the classic two-group, two-period design avoids the negative-weighting issues of staggered settings.
Event studies: When the full time profile of dynamic treatment effects is of primary interest, rather than a single summary parameter.
Synthetic DiD: When the parallel trends assumption is suspect and reweighting control units to match the pre-treatment trajectory of treated units improves credibility.
Synthetic control: When the number of treated units is very small (one to five) and constructing a data-driven counterfactual is more transparent than assuming parallel trends.

JReviewer Checklist

Critical Reading Checklist

0 of 8 items checked0%

Does the paper acknowledge that treatment timing is staggered?
Is a Goodman-Bacon decomposition or similar diagnostic reported?
Does the paper use a heterogeneity-robust estimator (Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille, or similar)?
Are group-time specific effects reported, not just an aggregate number?
Is the control group clearly specified (never-treated vs. not-yet-treated)?
Are pre-trends tested within each cohort, not just overall?
Is the no-anticipation assumption discussed?
Are TWFE results shown alongside modern estimator results for comparison?

Paper Library

Has replication code

Foundational (6)

Borusyak, K., Jaravel, X., & Spiess, J. (2024). Revisiting Event-Study Designs: Robust and Efficient Estimation.

Review of Economic StudiesDOI: 10.1093/restud/rdae007

Borusyak, Jaravel, and Spiess propose an imputation estimator for staggered DID that first estimates unit and time fixed effects from untreated observations, then imputes the counterfactual outcomes. This approach is efficient, flexible, and avoids the negative weighting problem of TWFE.

Callaway, B., & Sant'Anna, P. H. C. (2021). Difference-in-Differences with Multiple Time Periods.

Journal of EconometricsDOI: 10.1016/j.jeconom.2020.12.001

Callaway and Sant'Anna propose group-time average treatment effects (ATT(g,t)) that avoid the problematic comparisons in TWFE. Their framework allows for heterogeneous treatment effects across groups and time and provides aggregation schemes for summary parameters.

de Chaisemartin, C., & D'Haultfoeuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.

American Economic ReviewDOI: 10.1257/aer.20181169

De Chaisemartin and D'Haultfoeuille show that the TWFE estimator can assign negative weights to some treatment effects, potentially producing estimates with the wrong sign. They propose an alternative estimator and a decomposition that reveals which group-time effects receive negative weights.

Dube, A., Girardi, D., Jordà, Ò., & Taylor, A. M. (2025). A Local Projections Approach to Difference-in-Differences.

Journal of Applied EconometricsDOI: 10.1002/jae.70000

Dube and colleagues propose a local projections (LP) approach to difference-in-differences estimation that combines LPs with a flexible 'clean control' condition to define appropriate treated and control units. The LP-DiD estimator subsumes many recent solutions to negative weighting problems, accommodates covariates and nonabsorbing treatments, and is simple to implement.

Goodman-Bacon, A. (2021). Difference-in-Differences with Variation in Treatment Timing.

Journal of EconometricsDOI: 10.1016/j.jeconom.2021.03.014

Goodman-Bacon decomposes the two-way fixed-effects DID estimator into a weighted average of all possible two-group, two-period DID comparisons, revealing that some comparisons use already-treated units as controls. The decomposition clarifies when already-treated units enter as controls and why this can make the estimator difficult to interpret under treatment-effect heterogeneity.

Sun, L., & Abraham, S. (2021). Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.

Journal of EconometricsDOI: 10.1016/j.jeconom.2020.09.006

Sun and Abraham show that conventional event-study regression coefficients are contaminated by treatment effect heterogeneity across cohorts and propose an interaction-weighted estimator that recovers clean dynamic treatment effects. This paper is the key reference for event-study plots in staggered settings.

Application (2)

Baker, A. C., Larcker, D. F., & Wang, C. C. Y. (2022). How Much Should We Trust Staggered Difference-in-Differences Estimates?.

Journal of Financial EconomicsDOI: 10.1016/j.jfineco.2022.01.004

Baker, Larcker, and Wang demonstrate that the staggered DID problems identified in the econometrics literature are empirically relevant in finance research. They re-analyzed prominent finance studies and show that results can change substantially when using robust estimators.

Deshpande, M., & Li, Y. (2019). Who Is Screened Out? Application Costs and the Targeting of Disability Programs.

American Economic Journal: Economic PolicyDOI: 10.1257/pol.20180076

Deshpande and Li use staggered closings of Social Security field offices across the United States to estimate the effects of application costs on disability program participation. The staggered timing of office closures provides quasi-experimental variation in application costs, and the paper demonstrates how treatment-timing variation can be leveraged for credible policy evaluation.

Survey (2)

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.

Roth, J., Sant'Anna, P. H. C., Bilinski, A., & Poe, J. (2023). What's Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature.

Journal of EconometricsDOI: 10.1016/j.jeconom.2023.03.008

Roth et al. synthesize the explosion of recent econometric work on DID in this comprehensive survey, covering staggered treatment timing, heterogeneous treatment effects, pre-trends testing, and new estimators. It is the essential starting point for understanding the modern DID literature.

One-Line Implementation

Download Full Analysis Code

Motivating Example: Staggered Anti-Discrimination Laws#

AOverview#

Common Confusions#

BIdentification#

The Goodman-Bacon Decomposition#

Modern Solutions#

Key Assumptions#

CVisual Intuition#

DMathematical Derivation#

EImplementation#

FDiagnostics#

Interpreting Your Results#

GWhat Can Go Wrong#

Negative TWFE Estimate Despite Uniformly Positive Treatment Effects

Using Never-Treated as Controls When They Are Systematically Different

Ignoring Treatment Effect Heterogeneity Across Cohorts

HPractice#

Paper Summary

Key Table

Authors' Identification Claim

ISwap-In: When to Use Something Else#

JReviewer Checklist#

Critical Reading Checklist

Paper Library

Foundational (6)

Application (2)

Survey (2)

Tags

Motivating Example: Staggered Anti-Discrimination Laws

AOverview

Common Confusions

BIdentification

The Goodman-Bacon Decomposition

Modern Solutions

Key Assumptions

CVisual Intuition

DMathematical Derivation

EImplementation

FDiagnostics

Interpreting Your Results

GWhat Can Go Wrong

HPractice

ISwap-In: When to Use Something Else

JReviewer Checklist