MethodAtlas
ML + CausalFrontier

Causal Forests / Heterogeneous Treatment Effects

Estimates how treatment effects vary across individuals — who benefits most and who benefits least.

Quick Reference

When to Use
When you want to estimate treatment effect heterogeneity across subgroups without pre-specifying which subgroups matter — discovering who benefits most and who benefits least.
Key Assumption
Unconfoundedness (conditional on observables) plus honest estimation (separate training and estimation samples via sample splitting). Overlap (positivity) is also required.
Common Mistake
Interpreting variable importance from the forest as causal moderation — it reflects predictive importance for heterogeneity, not causal importance of those variables. Variable importance is about prediction, not causation.
Estimated Time
3 hours

One-Line Implementation

Stata: * No native Stata package — use rcall: rcall: library(grf); cf <- causal_forest(X, Y, W)
R: causal_forest(X, Y, W, num.trees = 2000, honesty = TRUE)
Python: CausalForestDML(model_y='auto', model_t='auto', n_estimators=2000).fit(Y, T, X=X)

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example

A hospital conducts a randomized trial of a new drug for heart disease. The trial shows that, on average, the drug reduces the risk of a heart attack by 5 percentage points. But the hospital administrator asks: which patients benefit most?

Perhaps the drug works well for patients with high cholesterol but provides little benefit for those with low cholesterol. Perhaps it is more effective for older patients. Perhaps there is a complex interaction between age, cholesterol, blood pressure, and diabetes status.

Traditional subgroup analysis requires you to pre-specify: "Let me check patients above vs. below age 65." But with 50 potential effect modifiers, this approach encounters the multiple testing problem directly. And if the analysis is exploratory rather than pre-specified, it is likely to produce "significant" heterogeneity even when there is none.

Causal forests solve this by using the random forest algorithm, modified to target treatment effect estimation rather than outcome prediction. They discover which covariates drive heterogeneity without requiring you to pre-specify the subgroups.

(Athey & Imbens, 2016) (Wager & Athey, 2018)

A. Overview

From Prediction to Causal Estimation

A standard random forest predicts the average outcome within each leaf, much like OLS predicts conditional means. A causal forest instead estimates the treatment effect within each leaf. The forest's output is a Conditional Average Treatment Effect (CATE):

τ^(x)=E^[Y(1)Y(0)X=x]\hat{\tau}(x) = \hat{E}[Y(1) - Y(0) | X = x]

This formula tells you: for a person with characteristics xx, what is the expected effect of treatment?

How Causal Trees Differ from Regular Trees

Regular trees split on variables that best predict YY. Causal trees split on variables that best separate treatment effects. At each node, the split is chosen to maximize the variance of treatment effects across the two child nodes. If splitting on "age > 65" produces groups with 10% and 1% treatment effects, that separation is a good split. If splitting on "gender" produces two groups with 5% effects each, it is uninformative.

Honest Estimation

A key innovation is honest estimation. In a standard random forest, the same data determine tree structure and estimate predictions within leaves. This dual use creates overfitting.

In an honest causal forest, data are split:

  • Structure sample: Determines where to split
  • Estimation sample: Estimates treatment effects within each leaf

This separation ensures CATE estimates are not biased by the tree-growing process.

Common Confusions

"Is variable importance from a causal forest causal?" No. Variable importance tells you which variables best predict heterogeneity in treatment effects. It does not tell you that those variables cause the heterogeneity. Zip code might have high importance because it proxies for diet, exercise, and healthcare access.

"Do I need an experiment?" Not necessarily, but unconfoundedness is required. Causal forests work with both experimental and observational data, as long as treatment is independent of potential outcomes conditional on XX. They are most credible with experimental data.

"How is this different from subgroup regressions?" Subgroup analysis requires pre-specified subgroups. Causal forests discover relevant subgroups from the data and handle complex interactions that would be difficult to specify manually. The trade-off is interpretability. For estimating an average treatment effect with ML-assisted confounding control, see double/debiased machine learning.

"Can I use causal forests for policy targeting?" Yes. If you estimate τ^(x)\hat{\tau}(x) for each person, you can target treatment to those with the largest predicted benefits. This application is called optimal treatment assignment or personalized policy learning.

B. Identification

The Target Estimand

τ(x)=E[Yi(1)Yi(0)Xi=x]\tau(x) = E[Y_i(1) - Y_i(0) | X_i = x]

Identifying Assumptions

  1. Unconfoundedness: (Y(0),Y(1)) ⁣ ⁣ ⁣WX(Y(0), Y(1)) \perp\!\!\!\perp W | X
  2. Overlap: 0<P(W=1X=x)<10 < P(W = 1 | X = x) < 1 for all xx
  3. SUTVA: No interference between units

The Causal Forest Estimator

Wager and Athey (2018) show that:

τ^(x)=i=1nαi(x)Yi\hat{\tau}(x) = \sum_{i=1}^{n} \alpha_i(x) \cdot Y_i

where the weights αi(x)\alpha_i(x) are determined by the forest structure. Under regularity conditions:

  • Consistency: τ^(x)τ(x)\hat{\tau}(x) \to \tau(x)
  • Asymptotic normality: (τ^(x)τ(x))/σ^(x)N(0,1)(\hat{\tau}(x) - \tau(x)) / \hat{\sigma}(x) \to N(0, 1), enabling pointwise confidence intervals

C. Visual Intuition

Imagine a scatterplot of patients with age on one axis and cholesterol on another. Each point is colored by treatment effect. A causal forest draws a heat map over this space, estimating the treatment effect at every point and discovering regions where effects cluster together.

Interactive Simulation

Causal Forest: Discovering Treatment Effect Heterogeneity

Adjust the true heterogeneity pattern. When effects are homogeneous, the forest correctly finds no heterogeneity. When effects vary, it discovers relevant subgroups.

07.1314.2721.40Simulated ValueAverage Tre…Heterogenei…Number ofSample SizeParameters
05
05
550
5005000
Interactive Simulation

Why Causal Forests?

DGP: Y = X + tau(X)·D + ε, where tau(X) = 2.0 + 1.5·X. Treatment is randomized. N = 300. True ATE = 2.018.

CATE Distribution-6.010.6Treatment EffectCATE by XCovariate X-6.2-2.02.36.510.8
Causal ForestOLS (const.)OLS + interactionTrue CATE

Estimation Results

Estimatorβ̂SE95% CIBias
OLS (constant effect)2.0540.188[1.69, 2.42]+0.037
OLS + interactionclosest2.0320.092[1.85, 2.21]+0.014
Causal Forest ATE1.9700.077[1.82, 2.12]-0.048
True β2.018
300

Number of observations

2.0

Constant component of CATE

1.5

How much the treatment effect varies with X (0 = constant)

Why the difference?

The true treatment effect varies with X: tau(x) = 2.0 + 1.5·x. OLS estimates a constant ATE (2.05), missing how the effect changes across subgroups. OLS with a D*X interaction recovers the linear heterogeneity pattern (slope = 1.48 vs truth = 1.5) but assumes a specific functional form. The causal forest provides individual-level CATE estimates (RMSE = 0.71 vs true CATE) without imposing parametric structure, making it ideal for discovering which subgroups benefit most from treatment.

D. Mathematical Derivation

Don't worry about the notation yet — here's what this means in words: Causal forests grow many honest causal trees, each splitting to maximize heterogeneity in treatment effects, using separate estimation samples to avoid overfitting.

Algorithm: Honest Causal Forest

For each tree b=1,,Bb = 1, \ldots, B:

  1. Draw a random subsample Sb\mathcal{S}_b of size s<ns < n
  2. Split Sb\mathcal{S}_b into structure sample Jb\mathcal{J}_b and estimation sample Ib\mathcal{I}_b
  3. Grow a tree using Jb\mathcal{J}_b: at each node, maximize:
Δ(CL,CR)=nLnRnL+nR(τ^CLτ^CR)2\Delta(C_L, C_R) = \frac{n_L n_R}{n_L + n_R} \cdot (\hat{\tau}_{C_L} - \hat{\tau}_{C_R})^2
  1. Estimate leaf treatment effects using Ib\mathcal{I}_b

Asymptotic normality ((Wager & Athey, 2018)):

Under honesty, subsampling (s/n0s/n \to 0, ss \to \infty), and regularity conditions:

τ^(x)τ(x)σ^(x)dN(0,1)\frac{\hat{\tau}(x) - \tau(x)}{\hat{\sigma}(x)} \xrightarrow{d} N(0, 1)

where σ^(x)\hat{\sigma}(x) uses the infinitesimal jackknife variance estimator.

E. Implementation

library(grf)

# Fit causal forest
cf <- causal_forest(
X = as.matrix(df[, covariate_cols]),
Y = df$outcome,
W = df$treatment,
num.trees = 2000,
honesty = TRUE,
seed = 42
)

# Estimate CATEs
cate <- predict(cf, estimate.variance = TRUE)
df$tau_hat <- cate$predictions
df$tau_se <- sqrt(cate$variance.estimates)

# ATE
ate <- average_treatment_effect(cf, target.sample = "all")
cat("ATE:", ate[1], "(SE:", ate[2], ")\n")

# Variable importance
varimp <- variable_importance(cf)
varimp_vec <- setNames(as.vector(varimp), covariate_cols)
sort(varimp_vec, decreasing = TRUE)[1:10]

# Calibration test for heterogeneity
test_calibration(cf)

# Best linear projection
blp <- best_linear_projection(cf,
A = as.matrix(df[, c("age", "cholesterol")]))
print(blp)
Requiresgrf

F. Diagnostics

  1. Calibration test. test_calibration() in grf tests whether estimated CATEs predict actual treatment effect variation. A significant "differential forest prediction" coefficient suggests genuine heterogeneity.

  2. CATE distribution. Plot τ^(x)\hat{\tau}(x). Tight concentration around the ATE suggests homogeneous effects. Wide spread suggests substantial heterogeneity.

  3. Variable importance. Report which covariates the forest uses most for splitting. Remember: predictive importance, not causal.

  4. Best linear projection. Project CATEs onto key covariates for interpretable coefficients.

  5. Overlap check. Verify propensity scores are bounded away from 0 and 1.

  6. Out-of-bag predictions. The grf package uses OOB predictions by default for honest estimation.

Interpreting Your Results

Significant heterogeneity detected: Report the CATE distribution, top variables by importance, and best linear projection. Discuss who benefits most and policy implications.

No significant heterogeneity: This null finding is informative. The ATE is a good summary for everyone.

Variable importance caveat: "Age has the highest variable importance" means age best predicts which people have larger or smaller effects. It does NOT mean age causes the difference. It is important to caveat this distinction.

G. What Can Go Wrong

Assumption Failure Demo

Interpreting Variable Importance as Causal Moderation

Report variable importance as predictive importance for heterogeneity, explicitly noting it does not imply causal moderation. Use best linear projection (BLP) to provide interpretable coefficients and discuss plausible causal mechanisms.

Variable importance: zip code (0.18), age (0.12), baseline cholesterol (0.10). BLP shows age coefficient of -0.05 (SE = 0.02), suggesting older patients benefit slightly less. The researcher discusses that age may proxy for comorbidities, not that age causally moderates the treatment effect.

Assumption Failure Demo

Small Sample Spurious Heterogeneity

With n = 5,000 from an RCT, fit a causal forest and run the calibration test. If the calibration test is non-significant, conclude that the ATE is a good summary for all individuals.

ATE: 3.2 pp (SE = 0.4). Calibration test p = 0.35. CATE distribution is tightly concentrated around the ATE. Conclusion: no detectable heterogeneity — the ATE applies broadly.

Assumption Failure Demo

Turning Off Honesty Invalidates Confidence Intervals

Set honesty = TRUE (the default in grf) so that tree structure is determined on one subsample and treatment effects are estimated on a separate subsample.

Honest causal forest: ATE = 3.2 (SE = 0.4). Pointwise 95% CIs for individual CATEs achieve 94.2% coverage in simulations. The separation of structure and estimation samples prevents overfitting.

H. Practice

Concept Check

You fit a causal forest to RCT data (n = 5,000). ATE is 3.2 pp (p < 0.01). Calibration test p = 0.35. Top variable importance: age (0.08), education (0.06). What should you conclude?

Guided Exercise

Causal Forests: Targeting a Financial Literacy Intervention

A fintech company runs an RCT testing whether a personalized financial literacy course (T) improves credit scores (Y) among 12,000 customers. After fitting a causal forest, the estimated CATEs range from near-zero for some customers to +35 points for others. The calibration test is significant (p = 0.001), and variable importance shows that baseline credit score (0.31), age (0.18), and income volatility (0.15) are the top predictors of CATE.

What does a significant calibration test tell you?

What does 'honest splitting' mean in causal forests, and why is it important?

Variable importance shows baseline credit score = 0.31. What does this mean, and what does it NOT mean?

How would you use CATE estimates to decide which customers to target with the course, given a fixed budget?

Error Detective

Read the analysis below carefully and identify the errors.

A health services researcher uses a causal forest to study which patients benefit most from a new diabetes drug in an RCT (n = 8,000). She reports: "The causal forest reveals substantial treatment effect heterogeneity (calibration test p = 0.01). Variable importance: BMI (0.22), age (0.15), HbA1c baseline (0.14). We recommend targeting the drug to patients with BMI > 30 and HbA1c > 8%, as these patients show the largest CATEs (mean CATE = 1.8 pp vs. 0.3 pp for others)." She writes: "High BMI causally moderates the drug's effect because obese patients have more insulin resistance, which the drug specifically addresses."

Select all errors you can find:

Error Detective

Read the analysis below carefully and identify the errors.

An education researcher uses a causal forest with observational data (n = 12,000) to estimate heterogeneous effects of school vouchers on test scores. She propensity-score adjusts for 20 covariates including family income, parental education, and prior test scores. She reports: "The causal forest finds that low-income students benefit most (CATE = 0.4 SD) while high-income students show zero effect (CATE = 0.01 SD). This supports targeting vouchers to low-income families." She uses honesty = FALSE because "the sample within income subgroups is small and we need all data for precision."

Select all errors you can find:

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors use a causal forest on an RCT (n = 20,000) of a job training program. They find substantial CATE heterogeneity: bottom quartile has -\$200, top quartile has +\$3,500. Top variable importance: age (0.15), education (0.12), prior earnings (0.11). They recommend targeting the program to the top two CATE quartiles.

Key Table

QuartileCATE95% CI
Q1 (lowest)-\$200[-\$800, \$400]
Q2\$500[-\$100, \$1100]
Q3\$1,800[\$1,000, \$2,600]
Q4 (highest)\$3,500[\$2,500, \$4,500]
Calibration test p-value: 0.002
Variable importance: age (0.15), education (0.12), prior_earnings (0.11)

Authors' Identification Claim

The RCT ensures unconfoundedness. The causal forest discovers that treatment effects vary substantially, with older, less-educated workers with low prior earnings benefiting most.

I. Swap-In: When to Use Something Else

  • OLS with interactions: When the dimensions of heterogeneity are known a priori and can be specified parametrically — simpler and more interpretable when theory provides clear subgroup hypotheses.
  • Pre-specified subgroup analysis: When the research question concerns a small number of pre-specified groups (e.g., by gender, age bracket) rather than a continuous heterogeneity surface.
  • DML: When the goal is an average treatment effect with high-dimensional confounders, rather than treatment effect heterogeneity.
  • Bayesian Additive Regression Trees (BART): An alternative ML-based approach to heterogeneous treatment effects with built-in uncertainty quantification and a different regularization philosophy.

J. Reviewer Checklist

Critical Reading Checklist


Paper Library

Foundational (4)

Athey, S., & Imbens, G. W. (2016). Recursive Partitioning for Heterogeneous Causal Effects.

Proceedings of the National Academy of SciencesDOI: 10.1073/pnas.1510489113

Athey and Imbens introduced causal trees, adapting the CART algorithm to estimate heterogeneous treatment effects with valid inference. They proposed the honest estimation approach, where one subsample is used for tree construction and another for estimation, ensuring valid confidence intervals.

Wager, S., & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.2017.1319839

Wager and Athey developed causal forests by extending random forests to estimate conditional average treatment effects. They proved pointwise consistency and asymptotic normality under regularity conditions, enabling valid confidence intervals for individualized treatment effect estimates.

Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized Random Forests.

Annals of StatisticsDOI: 10.1214/18-AOS1709Replication

This paper introduced the generalized random forest (GRF) framework, which extends causal forests to a broad class of estimating equations including quantile regression, IV, and local average treatment effects. GRF provides the theoretical foundation and the widely used grf R package.

Nie, X., & Wager, S. (2021). Quasi-Oracle Estimation of Heterogeneous Treatment Effects.

Nie and Wager proposed the R-learner, a two-step approach for estimating heterogeneous treatment effects that first residualizes outcomes and treatment on covariates, then estimates the CATE by regressing outcome residuals on treatment residuals. This approach can use any machine learning method including causal forests.

Application (5)

Kunzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.

Proceedings of the National Academy of SciencesDOI: 10.1073/pnas.1804597116

Kunzel and colleagues proposed the X-learner meta-algorithm for estimating CATEs and systematically compared it with T-learners and S-learners. The paper provides practical guidance on when different meta-learning strategies, including those based on causal forests, perform well or poorly.

Davis, J., & Heller, S. B. (2017). Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs.

American Economic Review: Papers & ProceedingsDOI: 10.1257/aer.p20171000

Davis and Heller applied causal forests to a randomized summer jobs program for disadvantaged youth in Chicago, demonstrating how the method can identify which subpopulations benefit most from a policy intervention. This paper is an accessible applied introduction to causal forests.

Brand, J. E., Xu, J., Koch, B., & Gerber, R. (2021). Uncovering Sociological Effect Heterogeneity Using Tree-Based Machine Learning.

Sociological MethodologyDOI: 10.1177/0081175021993503

Brand and colleagues provided a practical guide to using causal trees and forests in social science research. They discussed honest estimation, variable importance for understanding which covariates drive heterogeneity, and applied the methods to study heterogeneous returns to college education.

Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019). Orthogonal Random Forest for Causal Inference.

Proceedings of the 36th International Conference on Machine LearningReplication

Oprescu, Syrgkanis, and Wu combined orthogonal moment conditions from DML with random forests, creating orthogonal random forests that are robust to estimation of nuisance components. This approach bridges the DML and causal forest literatures and is implemented in Microsoft's EconML package.

Choudhury, P., Allen, R. T., & Endres, M. G. (2021). Machine Learning for Pattern Discovery in Management Research.

Strategic Management JournalDOI: 10.1002/smj.3215

Choudhury, Allen, and Endres discussed how machine learning methods including causal forests can be used for pattern discovery in management research. They provided guidance on when tree-based methods for heterogeneous treatment effects are appropriate for strategy and organizational questions.

Tags

ml-causalheterogeneous-effectsfrontier