When should I use Causal Forests / Heterogeneous Treatment Effects?

When you want to estimate treatment effect heterogeneity across subgroups without pre-specifying which subgroups matter — discovering who benefits most and who benefits least.

What is the key assumption of Causal Forests / Heterogeneous Treatment Effects?

Unconfoundedness (conditional on observables) plus honest estimation (separate training and estimation samples via sample splitting). Overlap (positivity) is also required.

Method·advanced·7 min read

ML + CausalFrontier

Causal Forests / Heterogeneous Treatment Effects

Estimates how treatment effects vary across individuals — who benefits most and who benefits least.

When to Use: When you want to estimate treatment effect heterogeneity across subgroups without pre-specifying which subgroups matter — discovering who benefits most and who benefits least.
Assumption: Unconfoundedness (conditional on observables) plus honest estimation (separate training and estimation samples via sample splitting). Overlap (positivity) is also required.
Mistake: Interpreting variable importance from the forest as causal moderation — it reflects predictive importance for heterogeneity, not causal importance of those variables. Variable importance is about prediction, not causation.
Reading Time: ~7 min read · 11 sections · 6 interactive exercises

One-Line Implementation

Rcausal_forest(X, Y, W, num.trees = 2000, honesty = TRUE)

Stata* No native Stata package -- use rcall: rcall: library(grf); cf <- causal_forest(X, Y, W)

PythonCausalForestDML(model_y='auto', model_t='auto', n_estimators=2000).fit(Y, T, X=X)

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: Who Benefits from the Drug?

A hospital conducts a randomized trial of a new drug for heart disease. The trial shows that, on average, the drug reduces the risk of a heart attack by 5 percentage points. But the hospital administrator asks: which patients benefit most?

Perhaps the drug works well for patients with high cholesterol but provides little benefit for those with low cholesterol. Perhaps it is more effective for older patients. Perhaps there is a complex interaction between age, cholesterol, blood pressure, and diabetes status.

Traditional subgroup analysis requires you to pre-specify: "Let me check patients above vs. below age 65." But with 50 potential effect modifiers, this approach encounters the multiple testing problem directly. And if the analysis is exploratory rather than pre-specified, it is likely to produce "significant" heterogeneity even when there is none.

Causal forests address this challenge by using the random forest algorithm, modified to target treatment effect estimation rather than outcome prediction (Athey & Imbens, 2016) (Wager & Athey, 2018). They discover which covariates drive heterogeneity without requiring you to pre-specify the subgroups.

AOverview

From Prediction to Causal Estimation

A standard random forest predicts the average outcome within each leaf, much like OLS predicts conditional means. A causal forest instead estimates the treatment effect within each leaf, enabling estimation of . The forest's output is a (CATE):

\hat{\tau}(x) = \hat{E}[Y(1) - Y(0) \mid X = x]

This formula tells you: for a person with characteristics $x$ , what is the expected effect of treatment?

How Causal Trees Differ from Regular Trees

Regular trees split on variables that best predict $Y$ . Causal trees split on variables that best separate treatment effects. At each node, the split is chosen to maximize the variance of treatment effects across the two child nodes. If splitting on "age > 65" produces groups with 10% and 1% treatment effects, that separation is a good split. If splitting on "gender" produces two groups with 5% effects each, it is uninformative.

Honest Estimation

A key innovation is honest estimation. In a standard random forest, the same data determine tree structure and estimate predictions within leaves. This dual use creates overfitting.

In an honest causal forest, data are split:

Structure sample: Determines where to split
Estimation sample: Estimates treatment effects within each leaf

This separation removes the overfitting bias that arises when the same data are used both to grow the tree and to estimate leaf effects.

Common Confusions

"Is variable importance from a causal forest causal?" No. Variable importance tells you which variables best predict heterogeneity in treatment effects. It does not tell you that those variables cause the heterogeneity. Zip code might have high importance because it proxies for diet, exercise, and healthcare access.

"Do I need an experiment?" Not necessarily, but unconfoundedness is required. Causal forests work with both experimental and observational data, as long as treatment is independent of potential outcomes conditional on $X$ . They are most credible with experimental data.

"How is this approach different from subgroup regressions?" Subgroup analysis requires pre-specified subgroups. Causal forests discover relevant subgroups from the data and handle complex interactions that would be difficult to specify manually. The trade-off is interpretability. For estimating an average treatment effect with ML-assisted confounding control, see double/debiased machine learning.

"Can I use causal forests for policy targeting?" Yes. If you estimate $\hat{\tau}(x)$ for each person, you can target treatment to those with the largest predicted benefits. This application is called optimal treatment assignment or personalized policy learning.

BIdentification

The Target Estimand

\tau(x) = E[Y_i(1) - Y_i(0) \mid X_i = x]

Identifying Assumptions

Unconfoundedness: $\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X$
Overlap: $0 < P(D = 1 \mid X = x) < 1$ for all $x$
SUTVA (Stable Unit Treatment Value Assumption): No interference between units
Honesty + subsampling rate: Each tree is built and estimated on disjoint subsamples, with subsample size $s$ satisfying $s/n \to 0$ and $s \to \infty$ at a polynomial rate (formalized in (Wager & Athey, 2018)). This condition is what licenses the asymptotic normality of the CATE estimates — see the derivation in §D for the precise statement.

The Causal Forest Estimator

Wager and Athey (2018) show that:

\hat{\tau}(x) = \sum_{i=1}^{n} \alpha_i(x) \cdot Y_i

where the weights $\alpha_i(x)$ are determined by the forest structure. Under regularity conditions:

Consistency: $\hat{\tau}(x) \to \tau(x)$
Asymptotic normality: $(\hat{\tau}(x) - \tau(x)) / \hat{\sigma}(x) \to N(0, 1)$ , enabling pointwise s

CVisual Intuition

Imagine a scatterplot of patients with age on one axis and cholesterol on another. Each point is colored by treatment effect. A causal forest draws a heat map over this space, estimating the treatment effect at every point and discovering regions where effects cluster together.

Explore how targeting treatment to high-CATE subgroups improves the average effect per treated individual:

DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: Causal forests grow many honest causal trees, each splitting to maximize heterogeneity in treatment effects, using separate estimation samples to avoid overfitting.

Algorithm: Honest Causal Forest

For each tree $b = 1, \ldots, B$ :

Draw a random subsample $\mathcal{S}_b$ of size $s < n$
Split $\mathcal{S}_b$ into structure sample $\mathcal{J}_b$ and estimation sample $\mathcal{I}_b$
Grow a tree using $\mathcal{J}_b$ : at each node, maximize:

\Delta(C_L, C_R) = \frac{n_L n_R}{n_L + n_R} \cdot (\hat{\tau}_{C_L} - \hat{\tau}_{C_R})^2

Estimate leaf treatment effects using $\mathcal{I}_b$

Asymptotic normality (Wager & Athey, 2018):

Under honesty, subsampling ( $s/n \to 0$ , $s \to \infty$ ), and regularity conditions:

\frac{\hat{\tau}(x) - \tau(x)}{\hat{\sigma}(x)} \xrightarrow{d} N(0, 1)

where $\hat{\sigma}(x)$ uses the infinitesimal jackknife variance estimator.

EImplementation

1# Requires: grf
2# grf: Generalized Random Forests (Athey, Tibshirani, Wager)
3library(grf)
4
5# --- Step 1: Fit the causal forest ---
6# causal_forest() grows an ensemble of honest causal trees
7# honesty = TRUE: separate subsamples for tree structure and leaf estimation
8# Under overlap and regularity conditions, this separation underwrites the
9# pointwise asymptotic normality result of Wager-Athey that supports CATE CIs
10cf <- causal_forest(
11X = as.matrix(df[, covariate_cols]),
12Y = df$outcome,
13W = df$treatment,
14num.trees = 2000,  # more trees = more stable estimates (diminishing returns past ~2000)
15honesty = TRUE,    # required for valid inference; do not turn off
16seed = 42
17)
18
19# --- Step 2: Estimate individual-level CATEs ---
20# predict() returns out-of-bag CATE estimates for each observation
21# estimate.variance = TRUE enables pointwise confidence intervals
22cate <- predict(cf, estimate.variance = TRUE)
23df$tau_hat <- cate$predictions          # estimated treatment effect for each unit
24df$tau_se <- sqrt(cate$variance.estimates)  # SE via infinitesimal jackknife
25
26# --- Step 3: Estimate the average treatment effect (ATE) ---
27# target.sample = "all" returns the AIPW doubly-robust ATE estimate over the
28# full sample (not a simple mean of CATEs). The doubly-robust correction
29# uses both the forest's CATE predictions and IPW weights, yielding valid
30# inference even if the CATE predictions are mildly misspecified.
31ate <- average_treatment_effect(cf, target.sample = "all")
32cat("ATE:", ate[1], "(SE:", ate[2], ")\n")
33# Interpretation: the population-average causal effect of treatment
34
35# --- Step 4: Variable importance ---
36# Measures which covariates the forest uses most for splitting
37# NOTE: this is predictive importance for heterogeneity, NOT causal moderation
38varimp <- variable_importance(cf)
39varimp_vec <- setNames(as.vector(varimp), covariate_cols)
40sort(varimp_vec, decreasing = TRUE)[1:10]
41
42# --- Step 5: Calibration test for heterogeneity ---
43# Tests whether estimated CATEs predict actual treatment effect variation
44# A significant "differential forest prediction" suggests genuine heterogeneity
45test_calibration(cf)
46
47# --- Step 6: Best linear projection ---
48# Projects CATEs onto specific covariates for interpretable coefficients
49# Use this (not variable importance) to test specific moderation hypotheses
50blp <- best_linear_projection(cf,
51A = as.matrix(df[, c("age", "cholesterol")]))
52print(blp)

Requiresgrf AIPW

FDiagnostics

Calibration test. test_calibration() in grf tests whether estimated CATEs predict actual treatment effect variation. A significant "differential forest prediction" coefficient suggests genuine heterogeneity.
CATE distribution. Plot $\hat{\tau}(x)$ . Tight concentration around the ATE suggests homogeneous effects. Wide spread suggests substantial heterogeneity.
Variable importance. Report which covariates the forest uses most for splitting (predictive, not causal — see warning above).
Best linear projection. Project CATEs onto key covariates for interpretable coefficients.
Overlap check. Verify propensity scores are bounded away from 0 and 1.
Out-of-bag predictions. The grf package uses OOB predictions by default for honest estimation.

Interpreting Your Results

Significant heterogeneity detected: Report the CATE distribution, top variables by importance, and best linear projection. Discuss who benefits most and policy implications.

No significant heterogeneity: This null finding is informative. The ATE is a good summary for everyone.

Variable importance caveat: "Age has the highest variable importance" means age best predicts which people have larger or smaller effects. It does NOT mean age causes the difference. It is important to caveat this distinction.

GWhat Can Go Wrong

Common Pitfalls

Interpreting variable importance as causal. It is predictive importance for heterogeneity, not causal moderation.
Using CATEs for targeting without uncertainty. Individual CATEs are noisy. Use confidence intervals and weigh false positive/negative costs.
Ignoring overlap. In regions with extreme propensity scores, CATEs are unreliable.
Small samples. Causal forests typically need n > 1,000 for meaningful heterogeneity detection. With small samples, spurious patterns emerge.
Forgetting honesty. In most settings, set honesty = TRUE. Without it, confidence intervals are invalid.
Cherry-picking subgroups. If you discover subgroups and only report those subgroups, you re-introduce the multiple testing problem you were trying to avoid. Pre-registration of the analysis plan and heterogeneity dimensions can mitigate this concern.

What Can Go Wrong

Interpreting Variable Importance as Causal Moderation

Report variable importance as predictive importance for heterogeneity, explicitly noting it does not imply causal moderation. Use best linear projection (BLP) to provide interpretable coefficients and discuss plausible causal mechanisms.

Variable importance: zip code (0.18), age (0.12), baseline cholesterol (0.10). BLP shows age coefficient of -0.05 (SE = 0.02), suggesting older patients benefit slightly less. The researcher discusses that age may proxy for comorbidities, not that age causally moderates the treatment effect.

What Can Go Wrong

Small Sample Spurious Heterogeneity

With n = 5,000 from an RCT, fit a causal forest and run the calibration test. If the calibration test is non-significant, conclude that the ATE is a good summary for all individuals.

ATE: 3.2 pp (SE = 0.4). Calibration test p = 0.35. CATE distribution is tightly concentrated around the ATE. Conclusion: no detectable heterogeneity — the ATE applies broadly.

What Can Go Wrong

Turning Off Honesty Invalidates Confidence Intervals

Set honesty = TRUE (the default in grf) so that tree structure is determined on one subsample and treatment effects are estimated on a separate subsample.

Honest causal forest: ATE = 3.2 (SE = 0.4). Pointwise 95% CIs for individual CATEs achieve 94.2% coverage in simulations. The separation of structure and estimation samples prevents overfitting.

HPractice

Concept Check

You fit a causal forest to RCT data (n = 5,000). ATE is 3.2 pp (p < 0.01). Calibration test p = 0.35. Top variable importance: age (0.08), education (0.06). What should you conclude?

The drug works on average but has no detectable heterogeneity — the ATE is a good summary for all patientsAge moderates the treatment effect because it has the highest variable importanceThe sample is too small to detect heterogeneityThe calibration test p-value of 0.35 means there is a 35% chance that heterogeneity exists

Concept Check

Why does the causal forest algorithm use 'honesty' (separate subsamples for building the tree structure and estimating leaf effects)?

To speed up computation by reducing the number of observationsTo prevent overfitting by ensuring the tree structure does not adapt to the same data used for estimationTo increase the number of trees in the forestTo allow for clustering at the individual level

Guided Exercise

Causal Forests: Targeting a Financial Literacy Intervention

A fintech company runs an RCT testing whether a personalized financial literacy course (T) improves credit scores (Y) among 12,000 customers. After fitting a causal forest, the estimated CATEs range from near-zero for some customers to +35 points for others. The calibration test is significant (p = 0.001), and variable importance shows that baseline credit score (0.31), age (0.18), and income volatility (0.15) are the top predictors of CATE.

Error Detective

Read the analysis below carefully and identify the errors.

A health services researcher uses a causal forest to study which patients benefit most from a new diabetes drug in an RCT (n = 8,000). She reports: "The causal forest reveals substantial treatment effect heterogeneity (calibration test p = 0.01). Variable importance: BMI (0.22), age (0.15), HbA1c baseline (0.14). We recommend targeting the drug to patients with BMI > 30 and HbA1c > 8%, as these patients show the largest CATEs (mean CATE = 1.8 pp vs. 0.3 pp for others)." She writes: "High BMI causally moderates the drug's effect because obese patients have more insulin resistance, which the drug specifically addresses."

Select all errors you can find:

Interpreting variable importance as evidence of causal moderation(The sentence claiming BMI 'causally moderates' the drug's effect)

Using data-driven CATE cutoffs for targeting without accounting for multiplicity(The targeting recommendation based on post-hoc cutoffs)

Error Detective

Read the analysis below carefully and identify the errors.

An education researcher uses a causal forest with observational data (n = 12,000) to estimate heterogeneous effects of school vouchers on test scores. She propensity-score adjusts for 20 covariates including family income, parental education, and prior test scores. She reports: "The causal forest finds that low-income students benefit most (CATE = 0.4 SD) while high-income students show zero effect (CATE = 0.01 SD). The CATE gradient supports targeting vouchers to low-income families." She uses honesty = FALSE because "the sample within income subgroups is small and we need all data for precision."

Select all errors you can find:

Turning off honesty invalidates confidence intervals and subgroup claims(The justification for using honesty = FALSE)

Observational data with voucher self-selection raises unconfoundedness concerns(The observational design with propensity score adjustment)

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors use a causal forest on an RCT (n = 20,000) of a job training program. They find substantial CATE heterogeneity: bottom quartile has -$200, top quartile has +$3,500. Top variable importance: age (0.15), education (0.12), prior earnings (0.11). They recommend targeting the program to the top two CATE quartiles.

Key Table

Quartile	CATE	95% CI
Q1 (lowest)	-$200	[-$800, $400]
Q2	$500	[-$100, $1100]
Q3	$1,800	[$1,000, $2,600]
Q4 (highest)	$3,500	[$2,500, $4,500]

Calibration test p-value: 0.002
Variable importance: age (0.15), education (0.12), prior_earnings (0.11)

Authors' Identification Claim

The RCT ensures unconfoundedness. The causal forest discovers that treatment effects vary substantially, with older, less-educated workers with low prior earnings benefiting most.

ISwap-In: When to Use Something Else

OLS with interactions: When the dimensions of heterogeneity are known a priori and can be specified parametrically — simpler and more interpretable when theory provides clear subgroup hypotheses.
Pre-specified subgroup analysis: When the research question concerns a small number of pre-specified groups (e.g., by gender, age bracket) rather than a continuous heterogeneity surface.
DML: When the goal is an average treatment effect with high-dimensional confounders, rather than treatment effect heterogeneity.
Bayesian Additive Regression Trees (BART): An alternative ML-based approach to heterogeneous treatment effects with built-in uncertainty quantification and a different regularization philosophy.

JReviewer Checklist

Critical Reading Checklist

0 of 8 items checked0%

Is unconfoundedness justified (RCT or strong observational design)?
Is honest estimation used?
Is the calibration test for heterogeneity reported?
Is variable importance interpreted as predictive, not causal?
Is the CATE distribution plotted and discussed?
Is overlap (positivity) assessed?
Is best linear projection used for interpretable results?
Are confidence intervals reported for CATEs?

Paper Library

Has replication code

Foundational (6)

Athey, S., & Imbens, G. W. (2016). Recursive Partitioning for Heterogeneous Causal Effects.

Proceedings of the National Academy of SciencesDOI: 10.1073/pnas.1510489113

Athey and Imbens introduce causal trees, adapting the CART algorithm to estimate heterogeneous treatment effects with valid inference. They propose the honest estimation approach, where one subsample is used for tree construction and another for estimation, ensuring valid confidence intervals.

Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized Random Forests.

Annals of StatisticsDOI: 10.1214/18-AOS1709 Replication

Athey, Tibshirani, and Wager introduce the generalized random forest (GRF) framework, which extends causal forests to a broad class of estimating equations including quantile regression, IV, and local average treatment effects. GRF provides the theoretical foundation and the widely used grf R package.

Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.

Proceedings of the National Academy of SciencesDOI: 10.1073/pnas.1804597116

Künzel and colleagues propose the X-learner meta-algorithm for estimating CATEs and systematically compare it with T-learners and S-learners using random forests and BART as base learners. The paper provides practical guidance on when different meta-learning strategies perform well or poorly.

Nie, X., & Wager, S. (2021). Quasi-Oracle Estimation of Heterogeneous Treatment Effects.

BiometrikaDOI: 10.1093/biomet/asaa076

Nie and Wager propose the R-learner, a two-step approach for estimating heterogeneous treatment effects that first residualizes outcomes and treatment on covariates, then estimates the CATE by regressing outcome residuals on treatment residuals. This approach can use any machine learning method including causal forests.

Oprescu, M., Syrgkanis, V., & Wu, Z. S. (2019). Orthogonal Random Forest for Causal Inference.

Proceedings of the 36th International Conference on Machine LearningReplication

Oprescu, Syrgkanis, and Wu propose orthogonal random forests, which combine Neyman-orthogonal moments with generalized random forests to reduce sensitivity to nuisance-estimation error. The paper provides theoretical results and shows how the method can be used for heterogeneous-effect estimation with discrete or continuous treatments.

Wager, S., & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.2017.1319839

Wager and Athey develop causal forests by extending random forests to estimate conditional average treatment effects. They prove pointwise consistency and asymptotic normality under regularity conditions, enabling valid confidence intervals for individualized treatment effect estimates.

Application (2)

Brand, J. E., Xu, J., Koch, B., & Geraldo, P. (2021). Uncovering Sociological Effect Heterogeneity Using Tree-Based Machine Learning.

Sociological MethodologyDOI: 10.1177/0081175021993503

Brand and colleagues provide a practical guide to using causal trees and forests in social science research. They discuss honest estimation, variable importance for understanding which covariates drive heterogeneity, and apply the methods to study heterogeneous returns to college education.

Davis, J., & Heller, S. B. (2017). Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs.

American Economic ReviewDOI: 10.1257/aer.p20171000

Davis and Heller apply causal forests to two randomized controlled trials of the One Summer Chicago youth summer jobs program, exploring how useful predicted treatment-effect heterogeneity is in practice. They train on one trial and validate predicted effects in the other, finding that the method can identify heterogeneity for some outcomes that standard interaction methods miss, while highlighting limitations of the approach.

One-Line Implementation

Download Full Analysis Code

Motivating Example: Who Benefits from the Drug?#

AOverview#

From Prediction to Causal Estimation#

How Causal Trees Differ from Regular Trees#

Honest Estimation#

Common Confusions#

BIdentification#

The Target Estimand#

Identifying Assumptions#

The Causal Forest Estimator#

CVisual Intuition#

DMathematical Derivation#

EImplementation#

FDiagnostics#

Interpreting Your Results#

GWhat Can Go Wrong#

Interpreting Variable Importance as Causal Moderation

Small Sample Spurious Heterogeneity

Turning Off Honesty Invalidates Confidence Intervals

HPractice#

Paper Summary

Key Table

Authors' Identification Claim

ISwap-In: When to Use Something Else#

JReviewer Checklist#

Critical Reading Checklist

Paper Library

Foundational (6)

Application (2)

Tags

Motivating Example: Who Benefits from the Drug?

AOverview

From Prediction to Causal Estimation

How Causal Trees Differ from Regular Trees

Honest Estimation

Common Confusions

BIdentification

The Target Estimand

Identifying Assumptions

The Causal Forest Estimator

CVisual Intuition

DMathematical Derivation

EImplementation

FDiagnostics

Interpreting Your Results

GWhat Can Go Wrong

HPractice

ISwap-In: When to Use Something Else

JReviewer Checklist