Checklist

Diagnostics & Pitfalls

Every causal inference method requires diagnostic checks before, during, and after estimation. This page collects the most important checks and the most common pitfalls in one place.

Pre-Estimation Checks

Balance Tables

Compare covariates across treatment and control. Large imbalances signal selection bias.

MatchingDiDExperiments

Common Support / Overlap

Check that treated and control units share similar covariate distributions. Without overlap, extrapolation is unreliable.

MatchingDoubly Robust

Pre-Trend Tests (Event Study)

Plot leads and lags of treatment. Pre-treatment coefficients near zero are consistent with parallel trends, though a non-significant pre-trend test does not guarantee the assumption holds (Roth, 2022).

DiDEvent StudiesStaggered DiD

McCrary Density Test

Test for bunching of the running variable at the cutoff. Bunching raises concerns about manipulation of the running variable, which would violate the continuity assumption underlying RDD.

RDD (Sharp)RDD (Fuzzy)

Estimation Diagnostics

First-Stage F-Statistic

For IV: the widely used F < 10 screening rule (Staiger & Stock, 1997; formal critical values in Stock & Yogo, 2005) suggests a weak instrument. More recent work recommends the effective F-statistic with a threshold that depends on the tolerable bias (Montiel Olea & Pflueger, 2013), and Lee, McCrary, Moreira & Porter (2022) show that valid 5% t-test inference in the just-identified case requires F > 104.7. Consider weak-instrument robust inference (Anderson-Rubin, tF procedure) when the first stage is borderline.

IV / 2SLSShift-Share

Overidentification Tests

With multiple instruments, test whether all instruments are exogenous (Hansen J test). Rejection suggests at least one instrument may be invalid, though the test has low power in some settings and depends on the other instruments being valid.

IV / 2SLS

Hausman Test (FE vs RE)

Tests whether the random effects assumption holds. Rejection provides evidence in favor of fixed effects, though the test can be sensitive to other specification issues.

Fixed EffectsRandom Effects

Goodness of Fit (Pre-Treatment)

For synthetic control: how well does the synthetic unit match the treated unit before treatment?

Synthetic ControlSynthetic DiD

Post-Estimation Robustness

Sensitivity Analysis (Oster, Cinelli-Hazlett)

How much unobserved confounding would be needed to explain away your result?

OLSFEDiD

Placebo Tests

Apply the method to outcomes, groups, or time periods where the treatment should have no effect.

DiDRDDSynthetic Control

Specification Curve Analysis

How sensitive are results to analyst degrees of freedom? The approach involves running all defensible specifications to assess sensitivity.

OLSFEDiD

Bandwidth Sensitivity (RDD)

Check whether RDD estimates are robust to different bandwidth choices around the cutoff.

RDD (Sharp)RDD (Fuzzy)

Leave-One-Out (Synthetic Control)

Remove each donor unit one at a time. If results change dramatically, the synthetic control is fragile.

Synthetic Control

Common Pitfalls

Bad Controls

Including post-treatment variables as controls can introduce collider bias. As a general principle, control for pre-treatment covariates rather than post-treatment variables.

All regression-based methods

Incorrect Clustering

Standard errors are typically clustered at the level of treatment assignment rather than the level of observation.

DiDFEExperiments

p-Hacking / Specification Searching

Running many specifications and reporting only significant ones inflates false positive rates. Pre-registration can help mitigate this concern.

All methods

Winner's Curse

Published effect sizes are systematically overestimated because only significant results get published.

All methods

Weak Instruments

IV with weak instruments can exhibit bias toward the OLS estimate, and in finite samples this bias can exceed that of OLS itself (Bound, Jaeger & Baker, 1995). Staiger & Stock (1997) developed the asymptotic weak-instrument framework. Standard practice is to check the first-stage F-statistic and consider weak-instrument robust inference (Anderson-Rubin; Lee et al. 2022 tF).

IV / 2SLS

Negative Weights in TWFE

With staggered treatment and heterogeneous effects, TWFE can produce negative weights on some ATTs. Goodman-Bacon (2021, Journal of Econometrics) showed that the TWFE estimator is a weighted average of all 2×2 DiD comparisons, including comparisons that use already-treated units as controls — these can bias the estimator when treatment effects evolve over time. A common recommendation is to run the Goodman-Bacon decomposition as a diagnostic and, when contaminated comparisons have substantial weight, use modern DiD estimators that are robust to treatment effect heterogeneity.

Staggered DiDEvent Studies

Related Practices

These research practices provide formal frameworks for many of the diagnostic checks described above.

Pre-Analysis Plans & Pre-Registration Power Analysis & Sample-Size Planning Randomization Inference Clustering and Few-Cluster Inference Sensitivity Analysis for Unobservables Specification Curve Analysis Lee Bounds for Attrition Multiple Hypothesis Testing