Lab·tutorial·7 min read

tutorial120 minutes

Lab: Causal Forests for Heterogeneous Treatment Effects

Estimate individualized treatment effects via the Wager-Athey causal forest: move beyond ATE to identify who benefits most and design targeting policies.

Method: Causal Forests / Heterogeneous Treatment Effects
Languages: Python, R, Stata
Dataset: Simulated RCT with heterogeneous treatment effects

Overview

In this lab you will analyze a simulated randomized controlled trial where the treatment effect varies substantially across individuals. Rather than estimating a single average treatment effect (ATE), you will use causal forests to estimate the conditional average treatment effect (CATE) as a function of covariates, identify which variables drive treatment effect heterogeneity, and design an optimal treatment targeting policy.

What you will learn:

How to estimate heterogeneous treatment effects with causal forests
How to assess variable importance for treatment effect heterogeneity
How to evaluate CATE estimates using calibration tests and RATE curves
How to design and evaluate optimal targeting policies
How to compare causal forests with simple subgroup analysis

Prerequisites: Familiarity with random forests and basic causal inference. Completion of the OLS and DML tutorial labs is recommended.

Step 1: Simulate an RCT with Heterogeneous Effects

We create an experiment where the treatment effect depends on age and baseline risk.

1# First-time setup: install.packages(c("grf"))
2library(grf)
3
4set.seed(42)
5n <- 4000
6
7# Generate covariates: demographics and health characteristics
8age <- runif(n, 25, 65)
9income <- rlnorm(n, 10.5, 0.6)
10educ <- pmin(pmax(rnorm(n, 14, 3), 8), 22)       # Clip education to [8, 22]
11health_score <- pmin(pmax(rnorm(n, 50, 15), 0), 100) # Clip health to [0, 100]
12female <- rbinom(n, 1, 0.5)
13risk <- rnorm(n)  # Baseline risk factor
14
15# Treatment assignment: randomized (RCT with 50% probability)
16W <- rbinom(n, 1, 0.5)
17
18# True CATE: effect varies by risk, age, and education
19tau_true <- 4.5 + 3 * (risk > 0) - 0.1 * (age - 40) + 2 * (educ > 16)
20# Baseline potential outcome under control
21mu0 <- 50 + 0.5 * age + 0.3 * health_score - 2 * risk + rnorm(n, 0, 5)
22# Observed outcome under treatment assignment
23Y <- mu0 + W * tau_true
24
25# Covariate matrix for the causal forest
26X <- data.frame(age, income, educ, health_score, female, risk)
27
28cat("True ATE:", mean(tau_true), "\n")
29cat("CATE range:", range(tau_true), "\n")

Requiresgrf

Expected output:

Variable	Mean	Std Dev	Min	Max
Y	84.5	12.3	45.2	125.6
W	0.50	0.50	0	1
age	45.0	11.5	25.0	65.0
income	44,300	30,100	5,800	245,000
educ	14.0	2.8	8.0	22.0
health_score	50.1	14.0	0.0	100.0

True ATE: ~6.00
True CATE range: [~2.5, ~11.5]
Treatment rate: ~50%
Covariates that drive heterogeneity: age, risk, educ

Step 2: Estimate the Average Treatment Effect

First, confirm the overall ATE before looking at heterogeneity.

1# Simple difference in means (unbiased in an RCT)
2ate_simple <- mean(Y[W == 1]) - mean(Y[W == 0])
3
4cat("Diff in means:", ate_simple, "\n")
5
6# Covariate-adjusted OLS (more precise due to reduced residual variance)
7ols <- lm(Y ~ W + age + income + educ + health_score + female + risk)
8cat("OLS-adjusted:", coef(ols)["W"], "\n")
9cat("True ATE:", mean(tau_true), "\n")

Expected output:

Estimator	ATE Estimate	SE	True ATE
Difference in means	~6.00	~0.36	~6.00
OLS-adjusted	~6.00	~0.16	~6.00

Difference in means: ~6.00
OLS-adjusted ATE:    ~6.00 (SE: ~0.16)
True ATE:            ~6.00

The covariate-adjusted estimator has a much smaller standard error because adjusting for strong predictors of the outcome (age, health_score, risk) reduces residual variance.

Step 3: Estimate CATEs with a Causal Forest

1# Fit causal forest using the grf package
2X_mat <- as.matrix(X)
3
4cf <- causal_forest(X_mat, Y, W,
5                   num.trees = 2000,    # Number of trees in the forest
6                   min.node.size = 5,   # Minimum obs per leaf
7                   seed = 42)
8
9# Extract out-of-bag CATE predictions for each observation
10tau_hat <- predict(cf)$predictions
11
12# Evaluate accuracy by comparing estimated vs. true CATEs
13cat("Correlation:", cor(tau_true, tau_hat), "\n")
14cat("RMSE:", sqrt(mean((tau_true - tau_hat)^2)), "\n")
15cat("Estimated ATE:", mean(tau_hat), "\n")
16
17# Forest-based ATE with valid confidence interval (uses influence function)
18ate_cf <- average_treatment_effect(cf)
19cat("Forest ATE:", ate_cf[1], " SE:", ate_cf[2], "\n")

Requiresgrf

Expected output:

Correlation(true CATE, estimated CATE): ~0.85
RMSE:                                   ~1.2
Estimated ATE (mean of CATEs):          ~6.00
True ATE:                                ~6.00

Summary	Estimated CATE	True CATE
Mean	~6.00	~6.00
Std Dev	~2.5	~2.8
Min	~1.5	~0.0
Max	~10.5	~12.0

The correlation of ~0.85 indicates the causal forest successfully identifies who benefits more vs. less from treatment, even though individual-level estimates are noisy.

Concept Check

A causal forest produces CATE estimates for each individual. How should you interpret these individual-level predictions?

Each individual's CATE is estimated precisely and can be trusted at the individual level.The CATEs are useful for ranking individuals and identifying broad patterns of heterogeneity, but individual-level estimates have wide confidence intervals and should not be over-interpreted.The CATEs should only be used to compute the ATE by averaging.The CATEs are biased because forests are a black box.

Step 4: Variable Importance and Heterogeneity Drivers

1# Variable importance: measures how often each covariate is used for splitting
2vi <- variable_importance(cf)
3rownames(vi) <- colnames(X)
4vi_sorted <- sort(vi[,1], decreasing = TRUE)
5barplot(vi_sorted, horiz = TRUE, main = "Variable Importance for Heterogeneity",
6      xlab = "Importance")
7
8# Compare estimated vs. true CATEs by risk subgroup
9cat("\nCATEs by risk group:\n")
10cat("Low risk:", mean(tau_hat[risk <= 0]), "(true:", mean(tau_true[risk <= 0]), ")\n")
11cat("High risk:", mean(tau_hat[risk > 0]), "(true:", mean(tau_true[risk > 0]), ")\n")

Expected output:

Expected output: CATEs by age group

Age Group	Estimated CATE	True CATE
25–35	~7.5	~7.5
35–45	~6.5	~6.5
45–55	~5.5	~5.5
55–65	~4.5	~4.5

CATEs decline with age, consistent with the DGP: tau = 4.5 + 3*(risk > 0) - 0.1*(age - 40) + 2*(educ > 16).

Step 5: Calibration and the RATE Curve

Evaluate whether the CATE estimates actually predict treatment effect heterogeneity.

1# Calibration test: checks if predicted CATEs predict actual effect heterogeneity
2test_calibration <- test_calibration(cf)
3print(test_calibration)
4
5# RATE curve: evaluates targeting quality by ranking individuals by CATE
6rate <- rank_average_treatment_effect(cf, tau_true)
7plot(rate, main = "RATE Curve")
8
9# Best linear projection: projects CATE onto covariates for interpretability
10blp <- best_linear_projection(cf, X_mat)
11print(blp)

Expected output: CATE calibration by quintile

Quintile	Predicted CATE	Observed Effect	True CATE
Q1 (lowest)	~2.8	~3.0	~3.0
Q2	~4.8	~5.0	~4.8
Q3	~6.0	~5.8	~6.0
Q4	~7.2	~7.0	~7.2
Q5 (highest)	~9.0	~9.2	~9.0

Calibration is good: predicted CATEs closely track both the observed and true effects across quintiles, confirming the forest identifies genuine heterogeneity.

RATE (Rank Average Treatment Effect) curve

The RATE curve plots the average treatment effect for the top k% of individuals (ranked by predicted CATE) against the fraction treated. Expected pattern:

Blue line (Causal Forest ranking): Starts high (~9.0 for top 10%) and declines monotonically toward the ATE (~6.0) as more individuals are included. The curve is concave, reflecting diminishing marginal returns to expanding treatment.
Red dashed line (ATE): Horizontal at ~6.0, representing the effect of treating everyone.
Gray line (Random ranking): Fluctuates around the ATE, showing no targeting ability.
The area between the forest ranking curve and the random ranking curve is the AUTOC (area under the TOC), measuring targeting quality. A larger AUTOC indicates the forest successfully ranks individuals by treatment benefit.

Step 6: Optimal Targeting Policy

Use CATE estimates to design a policy that treats only those who benefit most.

1# Policy: treat the top 50% by estimated CATE
2threshold <- median(tau_hat)
3policy <- as.integer(tau_hat >= threshold)
4
5# Evaluate policy using true CATEs (only possible in simulation)
6cat("Average effect, targeted:", mean(tau_true[policy == 1]), "\n")
7cat("Average effect, not targeted:", mean(tau_true[policy == 0]), "\n")
8cat("Average effect, treating all:", mean(tau_true), "\n")
9
10# Compare forest-based targeting with a simple rule (risk > 0)
11simple_policy <- as.integer(risk > 0)
12cat("\nSimple rule benefit:", mean(tau_true[simple_policy == 1]), "\n")
13cat("Forest targeting benefit:", mean(tau_true[policy == 1]), "\n")

Expected output:

Policy	Avg. Effect for Targeted	Avg. Effect for Not Targeted	Gain from Targeting
Treat top 50% (forest)	~7.8	~4.2	~1.8 vs. treating all
Simple rule (risk > 0)	~7.5	~4.5	~1.5 vs. treating all

=== Optimal Targeting Policy (treat top 50%) ===
Average effect for those targeted:      ~7.8
Average effect for those NOT targeted:  ~4.2
Average effect if treating everyone:    ~6.0

Gain from targeting: ~1.8 per treated individual
Fraction correctly identified (true top 50%): ~80%

Simple rule (risk > 0) benefit:         ~7.5
Forest-based targeting benefit:         ~7.8

The forest-based targeting outperforms the simple rule because it combines information from risk, age, and education simultaneously, while the simple rule uses only one variable.

Concept Check

You estimate CATEs using a causal forest on observational (non-experimental) data and find that older individuals have smaller treatment effects. A colleague suggests this finding could be driven by differential selection into treatment rather than true heterogeneity. How can you address this concern?

This concern is unwarranted because causal forests automatically adjust for confounding.Validate the heterogeneity estimates using a separate experiment or a different identification strategy, or at minimum check that results are robust to alternative specifications.Increase the number of trees in the forest.Drop older individuals from the sample.

Exercises

Compare ML methods. Estimate CATEs using a causal forest, a T-learner (separate random forests for treated and control), and a linear interaction model. Which has the best RMSE for the true CATEs?
Out-of-sample validation. Split the data 50/50. Train the causal forest on the first half and evaluate CATE predictions on the second half using the calibration test.
Budget-constrained targeting. Suppose you can only treat 25% of the population. Design the optimal targeting policy and compute the expected average effect.
Nonlinear heterogeneity. Modify the DGP so that the treatment effect has a U-shape in age. Does the causal forest capture this pattern?

Expected output

If your code runs correctly, expect to see:

Average treatment effect (ATE): Around 5.0–7.0 (true ATE: approximately 6.0, depending on the sample distribution of risk, age, and educ)
CATE range: Predicted CATEs spanning from roughly 2.5 to 11.5, reflecting true heterogeneity (true range: approximately 2.5 to 11.5)
Variable importance: Risk, age, and education ranked as the top variables driving heterogeneity; income, health_score, and female ranked lower
Calibration test: The mean forest prediction should be close to the ATE; the "differential forest prediction" coefficient should be close to 1
Subgroup analysis: Higher CATEs for younger individuals (age < 40), high-risk individuals (risk > 0), and highly educated (educ > 16)
RATE curve: The Targeting Operator Characteristic (TOC) curve should be above the horizontal ATE line, indicating better-than-random targeting
Treatment rate: Approximately 50% (randomized)
Sample size: 4,000 observations

Summary

In this lab you learned:

Causal forests estimate the conditional average treatment effect (CATE) as a function of covariates, moving beyond the ATE
The method uses honest splitting (separate samples for determining splits and estimating effects) to produce valid confidence intervals
Variable importance reveals which covariates drive treatment effect heterogeneity
Calibration tests and RATE curves assess whether estimated CATEs are predictive of actual treatment effect heterogeneity
Optimal targeting policies assign treatment to individuals with the highest predicted CATEs, potentially improving welfare
CATE estimates from observational data should be validated before informing policy, as heterogeneity may reflect differential selection rather than true effect variation

Overview#

Step 1: Simulate an RCT with Heterogeneous Effects#

Step 2: Estimate the Average Treatment Effect#

Step 3: Estimate CATEs with a Causal Forest#

Step 4: Variable Importance and Heterogeneity Drivers#

Step 5: Calibration and the RATE Curve#

Step 6: Optimal Targeting Policy#

Exercises#

Summary#

Overview

Step 1: Simulate an RCT with Heterogeneous Effects

Step 2: Estimate the Average Treatment Effect

Step 3: Estimate CATEs with a Causal Forest

Step 4: Variable Importance and Heterogeneity Drivers

Step 5: Calibration and the RATE Curve

Step 6: Optimal Targeting Policy

Exercises

Summary