MethodAtlas
Method·advanced·8 min read
ML + CausalFrontier

Causal Forests / Heterogeneous Treatment Effects

Estimates how treatment effects vary across individuals — who benefits most and who benefits least.

When to UseWhen you want to estimate treatment effect heterogeneity across subgroups without pre-specifying which subgroups matter — discovering who benefits most and who benefits least.
AssumptionUnconfoundedness (conditional on observables) plus honest estimation (separate training and estimation samples via sample splitting). Overlap (positivity) is also required.
MistakeInterpreting variable importance from the forest as causal moderation — it reflects predictive importance for heterogeneity, not causal importance of those variables. Variable importance is about prediction, not causation.
Reading Time~8 min read · 11 sections · 6 interactive exercises

One-Line Implementation

Rcausal_forest(X, Y, W, num.trees = 2000, honesty = TRUE)
Stata* No native Stata package -- use rcall: rcall: library(grf); cf <- causal_forest(X, Y, W)
PythonCausalForestDML(model_y='auto', model_t='auto', n_estimators=2000).fit(Y, T, X=X)

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Who Benefits from the Drug?

A hospital conducts a randomized trial of a new drug for heart disease. The trial shows that, on average, the drug reduces the risk of a heart attack by 5 percentage points. But the hospital administrator asks: which patients benefit most?

Perhaps the drug works well for patients with high cholesterol but provides little benefit for those with low cholesterol. Perhaps it is more effective for older patients. Perhaps there is a complex interaction between age, cholesterol, blood pressure, and diabetes status.

Traditional subgroup analysis requires you to pre-specify: "Let me check patients above vs. below age 65." But with 50 potential effect modifiers, this approach encounters the multiple testing problem directly. And if the analysis is exploratory rather than pre-specified, it is likely to produce "significant" heterogeneity even when there is none.

Causal forests address this challenge by using the random forest algorithm, modified to target treatment effect estimation rather than outcome prediction (Athey & Imbens, 2016) (Wager & Athey, 2018). They discover which covariates drive heterogeneity without requiring you to pre-specify the subgroups.


AOverview

From Prediction to Causal Estimation

A standard random forest predicts the average outcome within each leaf, much like OLS predicts conditional means. A causal forest instead estimates the treatment effect within each leaf, enabling estimation of . The forest's output is a (CATE):

τ^(x)=E^[Y(1)Y(0)X=x]\hat{\tau}(x) = \hat{E}[Y(1) - Y(0) \mid X = x]

This formula tells you: for a person with characteristics xx, what is the expected effect of treatment?

How Causal Trees Differ from Regular Trees

Regular trees split on variables that best predict YY. Causal trees split on variables that best separate treatment effects. At each node, the split is chosen to maximize the variance of treatment effects across the two child nodes. If splitting on "age > 65" produces groups with 10% and 1% treatment effects, that separation is a good split. If splitting on "gender" produces two groups with 5% effects each, it is uninformative.

Honest Estimation

A key innovation is honest estimation. In a standard random forest, the same data determine tree structure and estimate predictions within leaves. This dual use creates overfitting.

In an honest causal forest, data are split:

  • Structure sample: Determines where to split
  • Estimation sample: Estimates treatment effects within each leaf

This separation ensures CATE estimates are not biased by the tree-growing process.

Common Confusions

"Is variable importance from a causal forest causal?" No. Variable importance tells you which variables best predict heterogeneity in treatment effects. It does not tell you that those variables cause the heterogeneity. Zip code might have high importance because it proxies for diet, exercise, and healthcare access.

"Do I need an experiment?" Not necessarily, but unconfoundedness is required. Causal forests work with both experimental and observational data, as long as treatment is independent of potential outcomes conditional on XX. They are most credible with experimental data.

"How is this approach different from subgroup regressions?" Subgroup analysis requires pre-specified subgroups. Causal forests discover relevant subgroups from the data and handle complex interactions that would be difficult to specify manually. The trade-off is interpretability. For estimating an average treatment effect with ML-assisted confounding control, see double/debiased machine learning.

"Can I use causal forests for policy targeting?" Yes. If you estimate τ^(x)\hat{\tau}(x) for each person, you can target treatment to those with the largest predicted benefits. This application is called optimal treatment assignment or personalized policy learning.


BIdentification

The Target Estimand

τ(x)=E[Yi(1)Yi(0)Xi=x]\tau(x) = E[Y_i(1) - Y_i(0) \mid X_i = x]

Identifying Assumptions

  1. Unconfoundedness: {Y(0),Y(1)} ⁣ ⁣ ⁣DX\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X
  2. Overlap: 0<P(D=1X=x)<10 < P(D = 1 \mid X = x) < 1 for all xx
  3. SUTVA: No interference between units

The Causal Forest Estimator

Wager and Athey (2018) show that:

τ^(x)=i=1nαi(x)Yi\hat{\tau}(x) = \sum_{i=1}^{n} \alpha_i(x) \cdot Y_i

where the weights αi(x)\alpha_i(x) are determined by the forest structure. Under regularity conditions:

  • Consistency: τ^(x)τ(x)\hat{\tau}(x) \to \tau(x)
  • Asymptotic normality: (τ^(x)τ(x))/σ^(x)N(0,1)(\hat{\tau}(x) - \tau(x)) / \hat{\sigma}(x) \to N(0, 1), enabling pointwise s

CVisual Intuition

Imagine a scatterplot of patients with age on one axis and cholesterol on another. Each point is colored by treatment effect. A causal forest draws a heat map over this space, estimating the treatment effect at every point and discovering regions where effects cluster together.

Explore how targeting treatment to high-CATE subgroups improves the average effect per treated individual:


DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: Causal forests grow many honest causal trees, each splitting to maximize heterogeneity in treatment effects, using separate estimation samples to avoid overfitting.

Algorithm: Honest Causal Forest

For each tree b=1,,Bb = 1, \ldots, B:

  1. Draw a random subsample Sb\mathcal{S}_b of size s<ns < n
  2. Split Sb\mathcal{S}_b into structure sample Jb\mathcal{J}_b and estimation sample Ib\mathcal{I}_b
  3. Grow a tree using Jb\mathcal{J}_b: at each node, maximize:
Δ(CL,CR)=nLnRnL+nR(τ^CLτ^CR)2\Delta(C_L, C_R) = \frac{n_L n_R}{n_L + n_R} \cdot (\hat{\tau}_{C_L} - \hat{\tau}_{C_R})^2
  1. Estimate leaf treatment effects using Ib\mathcal{I}_b

Asymptotic normality (Wager & Athey, 2018):

Under honesty, subsampling (s/n0s/n \to 0, ss \to \infty), and regularity conditions:

τ^(x)τ(x)σ^(x)dN(0,1)\frac{\hat{\tau}(x) - \tau(x)}{\hat{\sigma}(x)} \xrightarrow{d} N(0, 1)

where σ^(x)\hat{\sigma}(x) uses the infinitesimal jackknife variance estimator.


EImplementation

# Requires: grf
# grf: Generalized Random Forests (Athey, Tibshirani, Wager)
library(grf)

# --- Step 1: Fit the causal forest ---
# causal_forest() grows an ensemble of honest causal trees
# honesty = TRUE: separate subsamples for tree structure and leaf estimation
# This separation ensures valid confidence intervals for CATEs
cf <- causal_forest(
X = as.matrix(df[, covariate_cols]),
Y = df$outcome,
W = df$treatment,
num.trees = 2000,  # more trees = more stable estimates (diminishing returns past ~2000)
honesty = TRUE,    # required for valid inference; do not turn off
seed = 42
)

# --- Step 2: Estimate individual-level CATEs ---
# predict() returns out-of-bag CATE estimates for each observation
# estimate.variance = TRUE enables pointwise confidence intervals
cate <- predict(cf, estimate.variance = TRUE)
df$tau_hat <- cate$predictions          # estimated treatment effect for each unit
df$tau_se <- sqrt(cate$variance.estimates)  # SE via infinitesimal jackknife

# --- Step 3: Estimate the average treatment effect (ATE) ---
# target.sample = "all" averages CATEs across the full sample
ate <- average_treatment_effect(cf, target.sample = "all")
cat("ATE:", ate[1], "(SE:", ate[2], ")\n")
# Interpretation: the population-average causal effect of treatment

# --- Step 4: Variable importance ---
# Measures which covariates the forest uses most for splitting
# NOTE: this is predictive importance for heterogeneity, NOT causal moderation
varimp <- variable_importance(cf)
varimp_vec <- setNames(as.vector(varimp), covariate_cols)
sort(varimp_vec, decreasing = TRUE)[1:10]

# --- Step 5: Calibration test for heterogeneity ---
# Tests whether estimated CATEs predict actual treatment effect variation
# A significant "differential forest prediction" suggests genuine heterogeneity
test_calibration(cf)

# --- Step 6: Best linear projection ---
# Projects CATEs onto specific covariates for interpretable coefficients
# Use this (not variable importance) to test specific moderation hypotheses
blp <- best_linear_projection(cf,
A = as.matrix(df[, c("age", "cholesterol")]))
print(blp)
Requiresgrf

FDiagnostics

  1. Calibration test. test_calibration() in grf tests whether estimated CATEs predict actual treatment effect variation. A significant "differential forest prediction" coefficient suggests genuine heterogeneity.

  2. CATE distribution. Plot τ^(x)\hat{\tau}(x). Tight concentration around the ATE suggests homogeneous effects. Wide spread suggests substantial heterogeneity.

  3. Variable importance. Report which covariates the forest uses most for splitting. Remember: predictive importance, not causal.

  4. Best linear projection. Project CATEs onto key covariates for interpretable coefficients.

  5. Overlap check. Verify propensity scores are bounded away from 0 and 1.

  6. Out-of-bag predictions. The grf package uses OOB predictions by default for honest estimation.

Interpreting Your Results

Significant heterogeneity detected: Report the CATE distribution, top variables by importance, and best linear projection. Discuss who benefits most and policy implications.

No significant heterogeneity: This null finding is informative. The ATE is a good summary for everyone.

Variable importance caveat: "Age has the highest variable importance" means age best predicts which people have larger or smaller effects. It does NOT mean age causes the difference. It is important to caveat this distinction.


GWhat Can Go Wrong

What Can Go Wrong

Interpreting Variable Importance as Causal Moderation

Report variable importance as predictive importance for heterogeneity, explicitly noting it does not imply causal moderation. Use best linear projection (BLP) to provide interpretable coefficients and discuss plausible causal mechanisms.

Variable importance: zip code (0.18), age (0.12), baseline cholesterol (0.10). BLP shows age coefficient of -0.05 (SE = 0.02), suggesting older patients benefit slightly less. The researcher discusses that age may proxy for comorbidities, not that age causally moderates the treatment effect.

What Can Go Wrong

Small Sample Spurious Heterogeneity

With n = 5,000 from an RCT, fit a causal forest and run the calibration test. If the calibration test is non-significant, conclude that the ATE is a good summary for all individuals.

ATE: 3.2 pp (SE = 0.4). Calibration test p = 0.35. CATE distribution is tightly concentrated around the ATE. Conclusion: no detectable heterogeneity — the ATE applies broadly.

What Can Go Wrong

Turning Off Honesty Invalidates Confidence Intervals

Set honesty = TRUE (the default in grf) so that tree structure is determined on one subsample and treatment effects are estimated on a separate subsample.

Honest causal forest: ATE = 3.2 (SE = 0.4). Pointwise 95% CIs for individual CATEs achieve 94.2% coverage in simulations. The separation of structure and estimation samples prevents overfitting.


HPractice

Concept Check

You fit a causal forest to RCT data (n = 5,000). ATE is 3.2 pp (p < 0.01). Calibration test p = 0.35. Top variable importance: age (0.08), education (0.06). What should you conclude?

Concept Check

Why does the causal forest algorithm use 'honesty' (separate subsamples for building the tree structure and estimating leaf effects)?

Guided Exercise

Causal Forests: Targeting a Financial Literacy Intervention

A fintech company runs an RCT testing whether a personalized financial literacy course (T) improves credit scores (Y) among 12,000 customers. After fitting a causal forest, the estimated CATEs range from near-zero for some customers to +35 points for others. The calibration test is significant (p = 0.001), and variable importance shows that baseline credit score (0.31), age (0.18), and income volatility (0.15) are the top predictors of CATE.

What does a significant calibration test tell you?

What does 'honest splitting' mean in causal forests, and why is it important?

Variable importance shows baseline credit score = 0.31. What does this mean, and what does it NOT mean?

How would you use CATE estimates to decide which customers to target with the course, given a fixed budget?

Error Detective

Read the analysis below carefully and identify the errors.

A health services researcher uses a causal forest to study which patients benefit most from a new diabetes drug in an RCT (n = 8,000). She reports: "The causal forest reveals substantial treatment effect heterogeneity (calibration test p = 0.01). Variable importance: BMI (0.22), age (0.15), HbA1c baseline (0.14). We recommend targeting the drug to patients with BMI > 30 and HbA1c > 8%, as these patients show the largest CATEs (mean CATE = 1.8 pp vs. 0.3 pp for others)." She writes: "High BMI causally moderates the drug's effect because obese patients have more insulin resistance, which the drug specifically addresses."

Select all errors you can find:

Error Detective

Read the analysis below carefully and identify the errors.

An education researcher uses a causal forest with observational data (n = 12,000) to estimate heterogeneous effects of school vouchers on test scores. She propensity-score adjusts for 20 covariates including family income, parental education, and prior test scores. She reports: "The causal forest finds that low-income students benefit most (CATE = 0.4 SD) while high-income students show zero effect (CATE = 0.01 SD). This finding supports targeting vouchers to low-income families." She uses honesty = FALSE because "the sample within income subgroups is small and we need all data for precision."

Select all errors you can find:

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors use a causal forest on an RCT (n = 20,000) of a job training program. They find substantial CATE heterogeneity: bottom quartile has -$200, top quartile has +$3,500. Top variable importance: age (0.15), education (0.12), prior earnings (0.11). They recommend targeting the program to the top two CATE quartiles.

Key Table

QuartileCATE95% CI
Q1 (lowest)-$200[-$800, $400]
Q2$500[-$100, $1100]
Q3$1,800[$1,000, $2,600]
Q4 (highest)$3,500[$2,500, $4,500]
Calibration test p-value: 0.002
Variable importance: age (0.15), education (0.12), prior_earnings (0.11)

Authors' Identification Claim

The RCT ensures unconfoundedness. The causal forest discovers that treatment effects vary substantially, with older, less-educated workers with low prior earnings benefiting most.


ISwap-In: When to Use Something Else

  • OLS with interactions: When the dimensions of heterogeneity are known a priori and can be specified parametrically — simpler and more interpretable when theory provides clear subgroup hypotheses.
  • Pre-specified subgroup analysis: When the research question concerns a small number of pre-specified groups (e.g., by gender, age bracket) rather than a continuous heterogeneity surface.
  • DML: When the goal is an average treatment effect with high-dimensional confounders, rather than treatment effect heterogeneity.
  • Bayesian Additive Regression Trees (BART): An alternative ML-based approach to heterogeneous treatment effects with built-in uncertainty quantification and a different regularization philosophy.

JReviewer Checklist

Critical Reading Checklist

0 of 8 items checked0%

Paper Library

Foundational (6)

Athey, S., & Imbens, G. W. (2016). Recursive Partitioning for Heterogeneous Causal Effects.

Proceedings of the National Academy of SciencesDOI: 10.1073/pnas.1510489113

Athey and Imbens introduce causal trees, adapting the CART algorithm to estimate heterogeneous treatment effects with valid inference. They propose the honest estimation approach, where one subsample is used for tree construction and another for estimation, ensuring valid confidence intervals.

Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized Random Forests.

Annals of StatisticsDOI: 10.1214/18-AOS1709Replication

Athey, Tibshirani, and Wager introduce the generalized random forest (GRF) framework, which extends causal forests to a broad class of estimating equations including quantile regression, IV, and local average treatment effects. GRF provides the theoretical foundation and the widely used grf R package.

Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.

Proceedings of the National Academy of SciencesDOI: 10.1073/pnas.1804597116

Künzel and colleagues propose the X-learner meta-algorithm for estimating CATEs and systematically compare it with T-learners and S-learners using random forests and BART as base learners. The paper provides practical guidance on when different meta-learning strategies perform well or poorly.

Nie, X., & Wager, S. (2021). Quasi-Oracle Estimation of Heterogeneous Treatment Effects.

Nie and Wager propose the R-learner, a two-step approach for estimating heterogeneous treatment effects that first residualizes outcomes and treatment on covariates, then estimates the CATE by regressing outcome residuals on treatment residuals. This approach can use any machine learning method including causal forests.

Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019). Orthogonal Random Forest for Causal Inference.

Proceedings of the 36th International Conference on Machine LearningReplication

Oprescu, Syrgkanis, and Wu propose orthogonal random forests, which combine Neyman-orthogonal moments with generalized random forests to reduce sensitivity to nuisance-estimation error. The paper provides theoretical results and shows how the method can be used for heterogeneous-effect estimation with discrete or continuous treatments.

Wager, S., & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.2017.1319839

Wager and Athey develop causal forests by extending random forests to estimate conditional average treatment effects. They prove pointwise consistency and asymptotic normality under regularity conditions, enabling valid confidence intervals for individualized treatment effect estimates.

Application (2)

Brand, J. E., Xu, J., Koch, B., & Geraldo, P. (2021). Uncovering Sociological Effect Heterogeneity Using Tree-Based Machine Learning.

Sociological MethodologyDOI: 10.1177/0081175021993503

Brand and colleagues provide a practical guide to using causal trees and forests in social science research. They discuss honest estimation, variable importance for understanding which covariates drive heterogeneity, and apply the methods to study heterogeneous returns to college education.

Davis, J., & Heller, S. B. (2017). Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs.

American Economic ReviewDOI: 10.1257/aer.p20171000

Davis and Heller apply causal forests to a randomized summer jobs program for disadvantaged youth in Chicago, exploring how useful predicted treatment effect heterogeneity is in practice. They find the method can identify heterogeneity for some outcomes that standard interaction methods miss, while highlighting limitations of the approach.

Tags

ml-causalheterogeneous-effectsfrontier