Causal Forests / Heterogeneous Treatment Effects
Estimates how treatment effects vary across individuals — who benefits most and who benefits least.
Quick Reference
- When to Use
- When you want to estimate treatment effect heterogeneity across subgroups without pre-specifying which subgroups matter — discovering who benefits most and who benefits least.
- Key Assumption
- Unconfoundedness (conditional on observables) plus honest estimation (separate training and estimation samples via sample splitting). Overlap (positivity) is also required.
- Common Mistake
- Interpreting variable importance from the forest as causal moderation — it reflects predictive importance for heterogeneity, not causal importance of those variables. Variable importance is about prediction, not causation.
- Estimated Time
- 3 hours
One-Line Implementation
* No native Stata package — use rcall: rcall: library(grf); cf <- causal_forest(X, Y, W)causal_forest(X, Y, W, num.trees = 2000, honesty = TRUE)CausalForestDML(model_y='auto', model_t='auto', n_estimators=2000).fit(Y, T, X=X)Download Full Analysis Code
Complete scripts with diagnostics, robustness checks, and result export.
Motivating Example
A hospital conducts a randomized trial of a new drug for heart disease. The trial shows that, on average, the drug reduces the risk of a heart attack by 5 percentage points. But the hospital administrator asks: which patients benefit most?
Perhaps the drug works well for patients with high cholesterol but provides little benefit for those with low cholesterol. Perhaps it is more effective for older patients. Perhaps there is a complex interaction between age, cholesterol, blood pressure, and diabetes status.
Traditional subgroup analysis requires you to pre-specify: "Let me check patients above vs. below age 65." But with 50 potential effect modifiers, this approach encounters the multiple testing problem directly. And if the analysis is exploratory rather than pre-specified, it is likely to produce "significant" heterogeneity even when there is none.
Causal forests solve this by using the random forest algorithm, modified to target treatment effect estimation rather than outcome prediction. They discover which covariates drive heterogeneity without requiring you to pre-specify the subgroups.
(Athey & Imbens, 2016) (Wager & Athey, 2018)A. Overview
From Prediction to Causal Estimation
A standard random forest predicts the average outcome within each leaf, much like OLS predicts conditional means. A causal forest instead estimates the treatment effect within each leaf. The forest's output is a Conditional Average Treatment Effect (CATE):
This formula tells you: for a person with characteristics , what is the expected effect of treatment?
How Causal Trees Differ from Regular Trees
Regular trees split on variables that best predict . Causal trees split on variables that best separate treatment effects. At each node, the split is chosen to maximize the variance of treatment effects across the two child nodes. If splitting on "age > 65" produces groups with 10% and 1% treatment effects, that separation is a good split. If splitting on "gender" produces two groups with 5% effects each, it is uninformative.
Honest Estimation
A key innovation is honest estimation. In a standard random forest, the same data determine tree structure and estimate predictions within leaves. This dual use creates overfitting.
In an honest causal forest, data are split:
- Structure sample: Determines where to split
- Estimation sample: Estimates treatment effects within each leaf
This separation ensures CATE estimates are not biased by the tree-growing process.
Common Confusions
"Is variable importance from a causal forest causal?" No. Variable importance tells you which variables best predict heterogeneity in treatment effects. It does not tell you that those variables cause the heterogeneity. Zip code might have high importance because it proxies for diet, exercise, and healthcare access.
"Do I need an experiment?" Not necessarily, but unconfoundedness is required. Causal forests work with both experimental and observational data, as long as treatment is independent of potential outcomes conditional on . They are most credible with experimental data.
"How is this different from subgroup regressions?" Subgroup analysis requires pre-specified subgroups. Causal forests discover relevant subgroups from the data and handle complex interactions that would be difficult to specify manually. The trade-off is interpretability. For estimating an average treatment effect with ML-assisted confounding control, see double/debiased machine learning.
"Can I use causal forests for policy targeting?" Yes. If you estimate for each person, you can target treatment to those with the largest predicted benefits. This application is called optimal treatment assignment or personalized policy learning.
B. Identification
The Target Estimand
Identifying Assumptions
- Unconfoundedness:
- Overlap: for all
- SUTVA: No interference between units
The Causal Forest Estimator
Wager and Athey (2018) show that:
where the weights are determined by the forest structure. Under regularity conditions:
- Consistency:
- Asymptotic normality: , enabling pointwise confidence intervals
C. Visual Intuition
Imagine a scatterplot of patients with age on one axis and cholesterol on another. Each point is colored by treatment effect. A causal forest draws a heat map over this space, estimating the treatment effect at every point and discovering regions where effects cluster together.
Causal Forest: Discovering Treatment Effect Heterogeneity
Adjust the true heterogeneity pattern. When effects are homogeneous, the forest correctly finds no heterogeneity. When effects vary, it discovers relevant subgroups.
Why Causal Forests?
DGP: Y = X + tau(X)·D + ε, where tau(X) = 2.0 + 1.5·X. Treatment is randomized. N = 300. True ATE = 2.018.
Estimation Results
| Estimator | β̂ | SE | 95% CI | Bias |
|---|---|---|---|---|
| OLS (constant effect) | 2.054 | 0.188 | [1.69, 2.42] | +0.037 |
| OLS + interactionclosest | 2.032 | 0.092 | [1.85, 2.21] | +0.014 |
| Causal Forest ATE | 1.970 | 0.077 | [1.82, 2.12] | -0.048 |
| True β | 2.018 | — | — | — |
Number of observations
Constant component of CATE
How much the treatment effect varies with X (0 = constant)
Why the difference?
The true treatment effect varies with X: tau(x) = 2.0 + 1.5·x. OLS estimates a constant ATE (2.05), missing how the effect changes across subgroups. OLS with a D*X interaction recovers the linear heterogeneity pattern (slope = 1.48 vs truth = 1.5) but assumes a specific functional form. The causal forest provides individual-level CATE estimates (RMSE = 0.71 vs true CATE) without imposing parametric structure, making it ideal for discovering which subgroups benefit most from treatment.
D. Mathematical Derivation
Don't worry about the notation yet — here's what this means in words: Causal forests grow many honest causal trees, each splitting to maximize heterogeneity in treatment effects, using separate estimation samples to avoid overfitting.
Algorithm: Honest Causal Forest
For each tree :
- Draw a random subsample of size
- Split into structure sample and estimation sample
- Grow a tree using : at each node, maximize:
- Estimate leaf treatment effects using
Asymptotic normality ((Wager & Athey, 2018)):
Under honesty, subsampling (, ), and regularity conditions:
where uses the infinitesimal jackknife variance estimator.
E. Implementation
library(grf)
# Fit causal forest
cf <- causal_forest(
X = as.matrix(df[, covariate_cols]),
Y = df$outcome,
W = df$treatment,
num.trees = 2000,
honesty = TRUE,
seed = 42
)
# Estimate CATEs
cate <- predict(cf, estimate.variance = TRUE)
df$tau_hat <- cate$predictions
df$tau_se <- sqrt(cate$variance.estimates)
# ATE
ate <- average_treatment_effect(cf, target.sample = "all")
cat("ATE:", ate[1], "(SE:", ate[2], ")\n")
# Variable importance
varimp <- variable_importance(cf)
varimp_vec <- setNames(as.vector(varimp), covariate_cols)
sort(varimp_vec, decreasing = TRUE)[1:10]
# Calibration test for heterogeneity
test_calibration(cf)
# Best linear projection
blp <- best_linear_projection(cf,
A = as.matrix(df[, c("age", "cholesterol")]))
print(blp)F. Diagnostics
-
Calibration test.
test_calibration()in grf tests whether estimated CATEs predict actual treatment effect variation. A significant "differential forest prediction" coefficient suggests genuine heterogeneity. -
CATE distribution. Plot . Tight concentration around the ATE suggests homogeneous effects. Wide spread suggests substantial heterogeneity.
-
Variable importance. Report which covariates the forest uses most for splitting. Remember: predictive importance, not causal.
-
Best linear projection. Project CATEs onto key covariates for interpretable coefficients.
-
Overlap check. Verify propensity scores are bounded away from 0 and 1.
-
Out-of-bag predictions. The grf package uses OOB predictions by default for honest estimation.
Interpreting Your Results
Significant heterogeneity detected: Report the CATE distribution, top variables by importance, and best linear projection. Discuss who benefits most and policy implications.
No significant heterogeneity: This null finding is informative. The ATE is a good summary for everyone.
Variable importance caveat: "Age has the highest variable importance" means age best predicts which people have larger or smaller effects. It does NOT mean age causes the difference. It is important to caveat this distinction.
G. What Can Go Wrong
Interpreting Variable Importance as Causal Moderation
Report variable importance as predictive importance for heterogeneity, explicitly noting it does not imply causal moderation. Use best linear projection (BLP) to provide interpretable coefficients and discuss plausible causal mechanisms.
Variable importance: zip code (0.18), age (0.12), baseline cholesterol (0.10). BLP shows age coefficient of -0.05 (SE = 0.02), suggesting older patients benefit slightly less. The researcher discusses that age may proxy for comorbidities, not that age causally moderates the treatment effect.
Small Sample Spurious Heterogeneity
With n = 5,000 from an RCT, fit a causal forest and run the calibration test. If the calibration test is non-significant, conclude that the ATE is a good summary for all individuals.
ATE: 3.2 pp (SE = 0.4). Calibration test p = 0.35. CATE distribution is tightly concentrated around the ATE. Conclusion: no detectable heterogeneity — the ATE applies broadly.
Turning Off Honesty Invalidates Confidence Intervals
Set honesty = TRUE (the default in grf) so that tree structure is determined on one subsample and treatment effects are estimated on a separate subsample.
Honest causal forest: ATE = 3.2 (SE = 0.4). Pointwise 95% CIs for individual CATEs achieve 94.2% coverage in simulations. The separation of structure and estimation samples prevents overfitting.
H. Practice
You fit a causal forest to RCT data (n = 5,000). ATE is 3.2 pp (p < 0.01). Calibration test p = 0.35. Top variable importance: age (0.08), education (0.06). What should you conclude?
Causal Forests: Targeting a Financial Literacy Intervention
A fintech company runs an RCT testing whether a personalized financial literacy course (T) improves credit scores (Y) among 12,000 customers. After fitting a causal forest, the estimated CATEs range from near-zero for some customers to +35 points for others. The calibration test is significant (p = 0.001), and variable importance shows that baseline credit score (0.31), age (0.18), and income volatility (0.15) are the top predictors of CATE.
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.
Paper Summary
The authors use a causal forest on an RCT (n = 20,000) of a job training program. They find substantial CATE heterogeneity: bottom quartile has -\$200, top quartile has +\$3,500. Top variable importance: age (0.15), education (0.12), prior earnings (0.11). They recommend targeting the program to the top two CATE quartiles.
Key Table
| Quartile | CATE | 95% CI |
|---|---|---|
| Q1 (lowest) | -\$200 | [-\$800, \$400] |
| Q2 | \$500 | [-\$100, \$1100] |
| Q3 | \$1,800 | [\$1,000, \$2,600] |
| Q4 (highest) | \$3,500 | [\$2,500, \$4,500] |
Calibration test p-value: 0.002 Variable importance: age (0.15), education (0.12), prior_earnings (0.11)
Authors' Identification Claim
The RCT ensures unconfoundedness. The causal forest discovers that treatment effects vary substantially, with older, less-educated workers with low prior earnings benefiting most.
I. Swap-In: When to Use Something Else
- OLS with interactions: When the dimensions of heterogeneity are known a priori and can be specified parametrically — simpler and more interpretable when theory provides clear subgroup hypotheses.
- Pre-specified subgroup analysis: When the research question concerns a small number of pre-specified groups (e.g., by gender, age bracket) rather than a continuous heterogeneity surface.
- DML: When the goal is an average treatment effect with high-dimensional confounders, rather than treatment effect heterogeneity.
- Bayesian Additive Regression Trees (BART): An alternative ML-based approach to heterogeneous treatment effects with built-in uncertainty quantification and a different regularization philosophy.
J. Reviewer Checklist
Critical Reading Checklist
Paper Library
Foundational (4)
Athey, S., & Imbens, G. W. (2016). Recursive Partitioning for Heterogeneous Causal Effects.
Athey and Imbens introduced causal trees, adapting the CART algorithm to estimate heterogeneous treatment effects with valid inference. They proposed the honest estimation approach, where one subsample is used for tree construction and another for estimation, ensuring valid confidence intervals.
Wager, S., & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.
Wager and Athey developed causal forests by extending random forests to estimate conditional average treatment effects. They proved pointwise consistency and asymptotic normality under regularity conditions, enabling valid confidence intervals for individualized treatment effect estimates.
Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized Random Forests.
This paper introduced the generalized random forest (GRF) framework, which extends causal forests to a broad class of estimating equations including quantile regression, IV, and local average treatment effects. GRF provides the theoretical foundation and the widely used grf R package.
Nie, X., & Wager, S. (2021). Quasi-Oracle Estimation of Heterogeneous Treatment Effects.
Nie and Wager proposed the R-learner, a two-step approach for estimating heterogeneous treatment effects that first residualizes outcomes and treatment on covariates, then estimates the CATE by regressing outcome residuals on treatment residuals. This approach can use any machine learning method including causal forests.
Application (5)
Kunzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.
Kunzel and colleagues proposed the X-learner meta-algorithm for estimating CATEs and systematically compared it with T-learners and S-learners. The paper provides practical guidance on when different meta-learning strategies, including those based on causal forests, perform well or poorly.
Davis, J., & Heller, S. B. (2017). Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs.
Davis and Heller applied causal forests to a randomized summer jobs program for disadvantaged youth in Chicago, demonstrating how the method can identify which subpopulations benefit most from a policy intervention. This paper is an accessible applied introduction to causal forests.
Brand, J. E., Xu, J., Koch, B., & Gerber, R. (2021). Uncovering Sociological Effect Heterogeneity Using Tree-Based Machine Learning.
Brand and colleagues provided a practical guide to using causal trees and forests in social science research. They discussed honest estimation, variable importance for understanding which covariates drive heterogeneity, and applied the methods to study heterogeneous returns to college education.
Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019). Orthogonal Random Forest for Causal Inference.
Oprescu, Syrgkanis, and Wu combined orthogonal moment conditions from DML with random forests, creating orthogonal random forests that are robust to estimation of nuisance components. This approach bridges the DML and causal forest literatures and is implemented in Microsoft's EconML package.
Choudhury, P., Allen, R. T., & Endres, M. G. (2021). Machine Learning for Pattern Discovery in Management Research.
Choudhury, Allen, and Endres discussed how machine learning methods including causal forests can be used for pattern discovery in management research. They provided guidance on when tree-based methods for heterogeneous treatment effects are appropriate for strategy and organizational questions.