Guide·10 min read

Guide

Causal Inference Anti-Patterns

A catalog of common errors in causal inference research design, estimation, interpretation, and reporting — with explanations of what to do instead.

Reading Time: ~10 min read · 6 sections · 2 interactive exercises

What Is an Anti-Pattern?

An anti-pattern is a common practice that appears reasonable but is actually wrong or misleading. In causal inference, anti-patterns are especially dangerous because they can produce results that look convincing — complete with small p-values and well-formatted tables — while being fundamentally flawed.

This guide catalogs the most common anti-patterns encountered in applied causal inference papers. For each, we explain why it is wrong, how to recognize it, and what to do instead.

Design Anti-Patterns

1. Controls Without a Design

The mistake: Running OLS with a list of control variables and claiming the coefficient on the treatment variable is causal because "we controlled for everything."

Why it is wrong: Controls only address observed confounders. If any relevant confounder is unobserved, the estimate is biased. Moreover, "controlling for" a post-treatment variable or a collider can introduce new bias rather than removing it. Without a clear identification strategy — a source of exogenous variation or a credible argument for why all confounders are observed — adding controls is insufficient for causal claims.

What to do instead: Identify a source of exogenous variation (natural experiment) or, if relying on selection on observables, formally justify the conditional independence assumption and conduct sensitivity analysis for unobserved confounders. See the observational data workflow for a structured approach.

2. Controlling for Post-Treatment Variables ()

The mistake: Including variables that are themselves affected by the treatment as controls in a regression. For example, controlling for a firm's R&D spending when studying the effect of a tax policy on innovation, when the tax policy itself affects R&D spending.

Why it is wrong: Post-treatment controls block part of the causal pathway from treatment to outcome. If the treatment affects the outcome partly through the post-treatment variable, conditioning on it removes that indirect effect and biases the estimated total effect. Even worse, it can introduce collider bias.

What to do instead: Only control for pre-treatment variables. If you want to understand mechanisms (how much of the effect works through a specific channel), use causal mediation analysis, not "adding the mediator as a control."

Concept Check

You are studying the effect of a new management training program on employee productivity. You control for employee satisfaction, which is measured six months after the training. What is wrong with this approach?

Nothing — employee satisfaction is an important confounder that should be controlled for.Controlling for post-treatment satisfaction blocks part of the causal pathway and biases the treatment effect estimate.The problem is that satisfaction is hard to measure, not that it is post-treatment.Satisfaction should be used as an outcome instead of a control.

3. Matching on Post-Treatment Variables

The mistake: Constructing matched samples using variables measured after treatment assignment. This error is the matching equivalent of controlling for post-treatment variables.

Why it is wrong: The same reasoning applies. If treatment affects the matching variables, matching on them distorts the comparison and introduces bias.

What to do instead: Match only on pre-treatment variables. If the only comparable variables are post-treatment, consider whether matching is the right approach at all.

Estimation Anti-Patterns

4. Wrong Standard Errors

The mistake: Using default (homoskedastic) standard errors, or clustering at the wrong level.

Why it is wrong: Default standard errors assume that errors are identically and independently distributed. In practice, errors are nearly always heteroskedastic and often correlated within groups (firms over time, students within schools, counties within states). Using the wrong standard errors dramatically understates uncertainty, producing artificially small p-values.

What to do instead: At minimum, use heteroskedasticity-robust standard errors (HC1 or HC3). If treatment is assigned at a group level, cluster standard errors at that level. When in doubt, cluster at the most aggregate level of treatment assignment. With very few clusters (fewer than 30-50), consider wild cluster bootstrap. See Choosing Standard Errors for the design-dependent framework from (Abadie et al., 2023): clustering is not strictly required in every group-level-treatment setting, but defaulting to cluster-robust inference is the safe choice when you are unsure whether the research design or the sampling design triggers the need to cluster.

5. TWFE with Staggered Treatment

The mistake: Using a standard two-way fixed effects regression (unit and time fixed effects) when treatment is adopted at different times by different units, without recognizing that TWFE can produce biased estimates under treatment effect heterogeneity.

Why it is wrong: TWFE with staggered treatment implicitly uses already-treated units as controls and applies negative weights to some treatment effects. When effects vary across cohorts or over time (which they usually do), the TWFE estimate can be severely biased — even with the wrong sign (Goodman-Bacon, 2021).

What to do instead: Use modern staggered DiD estimators: Callaway and Sant'Anna (2021), Sun and Abraham (2021), Borusyak et al. (2024), or de Chaisemartin and D'Haultfoeuille (2020). These estimators avoid the problematic comparisons that bias TWFE.

6. Weak Instruments

The mistake: Using an instrument with a first-stage F-statistic below 10 (or below the Stock-Yogo critical values) and proceeding as if the IV estimate is reliable.

Why it is wrong: Weak instruments inflate the finite-sample variance of the IV estimator and produce confidence intervals whose nominal coverage is severely overstated. In the over-identified case, the estimator is also biased toward OLS — a bias that grows with the degree of overidentification; in the just-identified case the population mean of the IV estimator is undefined, but weak instruments still cause severe median bias and CI under-coverage (Staiger & Stock, 1997).

What to do instead: Report the first-stage F-statistic prominently. The Staiger and Stock (1997) rule of thumb is $F > 10$ ; Stock and Yogo (2005) Table 5.2 gives stricter thresholds depending on tolerance for bias relative to OLS. Lee et al. (2022) show that valid 5% t-test inference under just-identified IV requires $F > 104.7$ — substantially above the conventional benchmark. If the first stage falls between these benchmarks, use weak-instrument-robust inference methods (Anderson-Rubin confidence sets, the tF procedure). If the instrument is truly weak, it may not be useful for credible estimation.

7. Global Polynomials in RDD

The mistake: Fitting high-order polynomials (cubic, quartic, or higher) in the running variable across the entire support, rather than using local linear regression near the cutoff.

Why it is wrong: Global polynomials are sensitive to observations far from the cutoff, can generate spurious discontinuities, and are difficult to justify theoretically. They provide poor approximations to the true conditional expectation function near the cutoff.

What to do instead: Use local linear regression with a data-driven bandwidth (Calonico-Cattaneo-Titiunik procedure) as described in the sharp RDD method guide. Show robustness to different bandwidths. Plot the data with a scatter plot and local polynomial fit to visually assess the discontinuity.

Interpretation Anti-Patterns

8. Conflating Statistical and Economic Significance

The mistake: Declaring a result "significant" and treating this declaration as sufficient evidence that the effect matters. Or conversely, interpreting a non-significant result as evidence of no effect.

Why it is wrong: Statistical significance depends on sample size. With a large enough sample, trivially small effects become statistically significant. With a small sample, economically important effects can be statistically insignificant. Significance tells you whether the data can distinguish the effect from zero — not whether the effect is large enough to matter.

What to do instead: It is important to report and interpret the effect size. Express it as a percentage of the dependent variable mean, in standard deviation units, or relative to other known interventions. Report confidence intervals, not just p-values.

9. Treating Non-Significant Pre-Trends as Proof of Parallel Trends

The mistake: Running an event study for DiD, finding that pre-treatment coefficients are individually non-significant, and concluding that parallel trends hold.

Why it is wrong: Non-significance means you failed to reject the null of no pre-trend. It does not mean the null is true. If your pre-period is short, your sample is small, or your outcome is noisy, you may have very low power to detect meaningful pre-trends.

What to do instead: Examine the magnitude and pattern of pre-treatment coefficients, not just their significance. Report a joint F-test. Plot confidence intervals. Consider Rambachan and Roth (2023) honest DiD methods that allow for possible violations of parallel trends.

10. Overgeneralizing a LATE

The mistake: Estimating a local average treatment effect (from IV, RDD, or a complier analysis) and interpreting it as the average treatment effect for the entire population.

Why it is wrong: A LATE applies to compliers — the subpopulation whose treatment status is changed by the instrument. This subpopulation may be very different from the overall population. An RDD estimate applies to units at the cutoff. These local estimates do not necessarily generalize.

What to do instead: Be explicit about who the estimate applies to. Discuss what is known about compliers (their characteristics, how they differ from always-takers and never-takers). If generalization is important, discuss why the local effect might or might not apply more broadly.

Reporting Anti-Patterns

11. Specification Searching (p-Hacking)

The mistake: Running many specifications and reporting only those that produce significant results. This practice includes trying different control sets, different samples, different functional forms, and different outcome definitions until something is significant.

Why it is wrong: With enough specifications, you can find significance by chance. Twenty independent tests at the 5% level will produce one "significant" result on average, even when the true effect is zero. Selective reporting inflates the published rate of false positives.

What to do instead: Pre-register your analysis plan. If pre-registration is not feasible, report a specification curve or multiverse analysis showing results across all defensible specifications. Be transparent about what you tried.

Try running this simulation several times with 20 regressions at the 0.05 threshold. On average, you will find 1 "significant" result out of 20 — exactly the 5% false positive rate. Now increase the number of regressions to 50. The expected number of false positives jumps to 2.5. This arithmetic is why testing many specifications and reporting only the significant ones is deeply misleading: with enough tests, you can very easily find at least one apparently "significant" result by chance, even when the true effect is exactly zero.

12. Cherry-Picking Robustness Checks

The mistake: Claiming robustness by showing results from specifications that all tell a similar story while omitting the specifications that do not.

Why it is wrong: True robustness means varying the assumptions, not just the details. If all your "robustness checks" use the same data, similar models, and identical identification assumptions, they are not independent tests of your result.

What to do instead: Include genuinely different approaches — different identification strategies, different samples, different functional forms. Report the full range of estimates. If some specifications produce different results, discuss why and what this discrepancy implies.

Quick Reference: Anti-Pattern Checklist

Before submitting your paper, verify:

What Is an Anti-Pattern?#

Design Anti-Patterns#

1. Controls Without a Design#

2. Controlling for Post-Treatment Variables (Bad Controls)#

3. Matching on Post-Treatment Variables#

Estimation Anti-Patterns#

4. Wrong Standard Errors#

5. TWFE with Staggered Treatment#

6. Weak Instruments#

7. Global Polynomials in RDD#

Interpretation Anti-Patterns#

8. Conflating Statistical and Economic Significance#

9. Treating Non-Significant Pre-Trends as Proof of Parallel Trends#

10. Overgeneralizing a LATE#

Reporting Anti-Patterns#

11. Specification Searching (p-Hacking)#

12. Cherry-Picking Robustness Checks#

Quick Reference: Anti-Pattern Checklist#