Causal Inference Anti-Patterns
Common mistakes and anti-patterns in causal inference research. A catalog of errors to avoid in research design, estimation, interpretation, and reporting, with explanations of why each is wrong and what to do instead.
What Is an Anti-Pattern?
An anti-pattern is a common practice that appears reasonable but is actually wrong or misleading. In causal inference, anti-patterns are especially dangerous because they can produce results that look convincing — complete with small p-values and well-formatted tables — while being fundamentally flawed.
This guide catalogs the most common anti-patterns encountered in applied causal inference papers. For each, we explain why it is wrong, how to recognize it, and what to do instead.
Design Anti-Patterns
1. Controls Without a Design
The mistake: Running OLS with a list of control variables and claiming the coefficient on the treatment variable is causal because "we controlled for everything."
Why it is wrong: Controls only address observed confounders. If any relevant confounder is unobserved, the estimate is biased. Moreover, "controlling for" a post-treatment variable or a collider can introduce new bias rather than removing it. Without a clear identification strategy — a source of exogenous variation or a credible argument for why all confounders are observed — adding controls is insufficient for causal claims.
What to do instead: Identify a source of exogenous variation (natural experiment) or, if relying on selection on observables, formally justify the conditional independence assumption and conduct sensitivity analysis for unobserved confounders. See the observational data workflow for a structured approach.
2. Controlling for Post-Treatment Variables
The mistake: Including variables that are themselves affected by the treatment as controls in a regression. For example, controlling for a firm's R&D spending when studying the effect of a tax policy on innovation, when the tax policy itself affects R&D spending.
Why it is wrong: Post-treatment controls block part of the causal pathway from treatment to outcome. If the treatment affects the outcome partly through the post-treatment variable, conditioning on it removes that indirect effect and biases the estimated total effect. Even worse, it can introduce collider bias.
What to do instead: Only control for pre-treatment variables. If you want to understand mechanisms (how much of the effect works through a specific channel), use causal mediation analysis, not "adding the mediator as a control."
You are studying the effect of a new management training program on employee productivity. You control for employee satisfaction, which is measured six months after the training. What is wrong with this approach?
3. Matching on Post-Treatment Variables
The mistake: Constructing matched samples using variables measured after treatment assignment. This error is the matching equivalent of controlling for post-treatment variables.
Why it is wrong: The same reasoning applies. If treatment affects the matching variables, matching on them distorts the comparison and introduces bias.
What to do instead: Match only on pre-treatment variables. If the only comparable variables are post-treatment, consider whether matching is the right approach at all.
Estimation Anti-Patterns
4. Wrong Standard Errors
The mistake: Using default (homoskedastic) standard errors, or clustering at the wrong level.
Why it is wrong: Default standard errors assume that errors are identically and independently distributed. In practice, errors are nearly always heteroskedastic and often correlated within groups (firms over time, students within schools, counties within states). Using the wrong standard errors dramatically understates uncertainty, producing artificially small p-values.
What to do instead: At minimum, use heteroskedasticity-robust standard errors (HC1 or HC3). If treatment is assigned at a group level, cluster standard errors at that level. When in doubt, cluster at the most aggregate level of treatment assignment. With very few clusters (fewer than 30-50), consider wild cluster bootstrap.
5. TWFE with Staggered Treatment
The mistake: Using a standard two-way fixed effects regression (unit and time fixed effects) when treatment is adopted at different times by different units, without recognizing that TWFE can produce biased estimates under treatment effect heterogeneity.
Why it is wrong: TWFE with staggered treatment implicitly uses already-treated units as controls and applies negative weights to some treatment effects. When effects vary across cohorts or over time (which they usually do), the TWFE estimate can be severely biased — even with the wrong sign.
(Goodman-Bacon, 2021)What to do instead: Use modern staggered DiD estimators: Callaway and Sant'Anna (2021), Sun and Abraham (2021), Borusyak et al. (2024), or de Chaisemartin and D'Haultfoeuille (2020). These estimators avoid the problematic comparisons that bias TWFE.
6. Weak Instruments
The mistake: Using an instrument with a first-stage F-statistic below 10 (or below the Stock-Yogo critical values) and proceeding as if the IV estimate is reliable.
Why it is wrong: Weak instruments bias the IV estimate toward the OLS estimate in finite samples, defeating the purpose of instrumenting. They also produce unreliable standard errors, making inference invalid.
What to do instead: Report the first-stage F-statistic prominently. If it is below the Stock-Yogo thresholds, use weak-instrument-robust inference methods (Anderson-Rubin confidence sets, the tF procedure). If the instrument is truly weak, it may not be useful for credible estimation.
7. Global Polynomials in RDD
The mistake: Fitting high-order polynomials (cubic, quartic, or higher) in the running variable across the entire support, rather than using local linear regression near the cutoff.
Why it is wrong: Global polynomials are sensitive to observations far from the cutoff, can generate spurious discontinuities, and are difficult to justify theoretically. They provide poor approximations to the true conditional expectation function near the cutoff.
What to do instead: Use local linear regression with a data-driven bandwidth (Calonico-Cattaneo-Titiunik procedure) as described in the sharp RDD method guide. Show robustness to different bandwidths. Plot the data with a scatter plot and local polynomial fit to visually assess the discontinuity.
Interpretation Anti-Patterns
8. Conflating Statistical and Economic Significance
The mistake: Declaring a result "significant" and treating this declaration as sufficient evidence that the effect matters. Or conversely, interpreting a non-significant result as evidence of no effect.
Why it is wrong: Statistical significance depends on sample size. With a large enough sample, trivially small effects become statistically significant. With a small sample, economically important effects can be statistically insignificant. Significance tells you whether the data can distinguish the effect from zero — not whether the effect is large enough to matter.
What to do instead: It is important to report and interpret the effect size. Express it as a percentage of the dependent variable mean, in standard deviation units, or relative to other known interventions. Report confidence intervals, not just p-values.
9. Treating Non-Significant Pre-Trends as Proof of Parallel Trends
The mistake: Running an event study for DiD, finding that pre-treatment coefficients are individually non-significant, and concluding that parallel trends hold.
Why it is wrong: Non-significance means you failed to reject the null of no pre-trend. It does not mean the null is true. If your pre-period is short, your sample is small, or your outcome is noisy, you may have very low power to detect meaningful pre-trends.
What to do instead: Examine the magnitude and pattern of pre-treatment coefficients, not just their significance. Report a joint F-test. Plot confidence intervals. Consider Rambachan and Roth (2023) honest DiD methods that allow for possible violations of parallel trends.
10. Overgeneralizing a LATE
The mistake: Estimating a local average treatment effect (from IV, RDD, or a complier analysis) and interpreting it as the average treatment effect for the entire population.
Why it is wrong: A LATE applies to compliers — the subpopulation whose treatment status is changed by the instrument. This subpopulation may be very different from the overall population. An RDD estimate applies to units at the cutoff. These local estimates do not necessarily generalize.
What to do instead: Be explicit about who the estimate applies to. Discuss what is known about compliers (their characteristics, how they differ from always-takers and never-takers). If generalization is important, discuss why the local effect might or might not apply more broadly.
Reporting Anti-Patterns
11. Specification Searching (p-Hacking)
The mistake: Running many specifications and reporting only those that produce significant results. This practice includes trying different control sets, different samples, different functional forms, and different outcome definitions until something is significant.
Why it is wrong: With enough specifications, you can find significance by chance. Twenty independent tests at the 5% level will produce one "significant" result on average, even when the true effect is zero. Selective reporting inflates the published rate of false positives.
What to do instead: Pre-register your analysis plan. If that is not feasible, report a specification curve or multiverse analysis showing results across all defensible specifications. Be transparent about what you tried.
P-Hacking Simulator: Finding 'Significance' in Random Noise
Each simulation runs multiple regressions on completely random data — no true effect exists. Watch how often you get a 'significant' result (p < 0.05) purely by chance. Increase the number of regressions to see the problem grow worse: the more tests you run, the more false positives you find.
Try running this simulation several times with 20 regressions at the 0.05 threshold. On average, you will find 1 "significant" result out of 20 — exactly the 5% false positive rate. Now increase the number of regressions to 50. The expected number of false positives jumps to 2.5. This arithmetic is why testing many specifications and reporting only the significant ones is deeply misleading: with enough tests, you will always find something "significant," even when the true effect is exactly zero.
12. Cherry-Picking Robustness Checks
The mistake: Claiming robustness by showing results from specifications that all tell a similar story while omitting the specifications that do not.
Why it is wrong: True robustness means varying the assumptions, not just the details. If all your "robustness checks" use the same data, similar models, and identical identification assumptions, they are not independent tests of your result.
What to do instead: Include genuinely different approaches — different identification strategies, different samples, different functional forms. Report the full range of estimates. If some specifications produce different results, discuss why and what this discrepancy implies.
Quick Reference: Anti-Pattern Checklist
Before submitting your paper, verify:
- Your identification relies on a design, not just controls
- No post-treatment variables are included as controls or matching variables
- Standard errors are clustered at the level of treatment assignment
- If using TWFE with staggered adoption, you have checked for negative weights or used modern estimators
- First-stage F-statistic is reported and above relevant thresholds (for IV)
- RDD uses local linear regression, not global polynomials
- Effect sizes are interpreted economically, not just statistically
- Pre-trend analysis considers power, not just significance
- The estimand (ATE, ATT, LATE) is stated and the generalization discussion matches
- All pre-specified analyses are reported, not just the significant ones
- Robustness checks vary assumptions, not just specification details