Replication Lab: Causal Forests and Heterogeneous Treatment Effects
Replicate the key simulation results from Wager and Athey (2018) on causal forests. Simulate data with heterogeneous treatment effects, estimate ATE with OLS, estimate CATE with causal forests, assess variable importance, and conduct calibration tests.
Overview
In this replication lab, you will explore the methodology from two foundational papers on causal forests:
Athey, Susan, and Guido Imbens. 2016. "Recursive Partitioning for Heterogeneous Causal Effects." Proceedings of the National Academy of Sciences 113(27): 7353–7360.
Wager, Stefan, and Susan Athey. 2018. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests." Journal of the American Statistical Association 113(523): 1228–1242.
Causal forests extend random forests from prediction to causal inference. Instead of predicting Y, a causal forest estimates the conditional average treatment effect (CATE): tau(x) = E[Y(1) - Y(0) | X = x]. The method splits the covariate space to maximize treatment effect heterogeneity rather than outcome prediction accuracy. Combined with honesty (separate subsamples for splitting and estimation), causal forests produce pointwise asymptotically normal and consistent CATE estimates.
Why the Wager and Athey (2018) paper matters: It provided the first theoretically grounded method for estimating heterogeneous treatment effects using tree-based methods, complete with valid confidence intervals. The causal forest has become one of the most widely used tools for CATE estimation in economics.
What you will do:
- Simulate data with known treatment effect heterogeneity
- Estimate the ATE using OLS (ignoring heterogeneity)
- Estimate the CATE function using a causal forest
- Assess variable importance to identify effect modifiers
- Conduct calibration tests comparing estimated and true CATEs
Step 1: Simulate Data with Heterogeneous Treatment Effects
The DGP features a randomized experiment (treatment is independent of covariates) with treatment effect heterogeneity driven by X1 and X2.
library(grf)
library(data.table)
set.seed(2018)
# DGP parameters: 4000 observations, 10 covariates
n <- 4000; p <- 10
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("X", 1:p)
# Randomized treatment with 50% probability
W <- rbinom(n, 1, 0.5)
# True CATE depends only on X1 and X2
tau_true <- 1 + 2 * X[,1] + X[,2]
# Baseline outcome is nonlinear in X (creates a challenging estimation problem)
mu_0 <- 2 * X[,1] + X[,2]^2 + 0.5 * X[,3] * X[,4] + sin(X[,5])
# Observed outcome = baseline + treatment effect * treatment + noise
Y <- mu_0 + tau_true * W + rnorm(n)
true_ate <- mean(tau_true)
cat("n =", n, ", p =", p, "\n")
cat("True ATE:", round(true_ate, 3), "\n")
cat("CATE range:", round(range(tau_true), 2), "\n")Expected output:
Sample size: 4000
Covariates: 10
Treatment rate: ~50%
True CATE function: tau(x) = 1 + 2*X0 + X1
True ATE: ~1.00
CATE range: [-6.50, 8.50]
CATE std: ~2.24
Step 2: Estimate ATE with OLS (Ignoring Heterogeneity)
Before using causal forests, start with a simple OLS estimate that captures the average effect but misses heterogeneity.
# Simple difference in means (unbiased in an RCT)
ols_simple <- lm(Y ~ W)
cat("Difference in means:", coef(ols_simple)["W"], "\n")
# OLS with all X controls (reduces residual variance, improves precision)
ols_ctrl <- lm(Y ~ W + X)
cat("OLS + controls:", coef(ols_ctrl)["W"], "\n")
cat("True ATE:", true_ate, "\n")Expected output:
| Model | ATE | SE |
|---|---|---|
| Difference in means | ~1.00 | ~0.07 |
| OLS + controls | ~1.00 | ~0.05 |
| True ATE | ~1.00 | --- |
In a randomized experiment, why would you want to estimate the CATE function tau(x) rather than just the ATE?
Step 3: Estimate CATE with Causal Forest
The causal forest estimates tau(x) = E[Y(1) - Y(0) | X = x] by splitting the covariate space to maximize treatment effect heterogeneity.
# Fit causal forest using grf (honest splitting for valid inference)
cf <- causal_forest(X, Y, W,
num.trees = 2000, # Number of trees
min.node.size = 5, # Minimum leaf size
seed = 42)
# Out-of-bag CATE predictions for each observation
tau_hat <- predict(cf)$predictions
# Forest-based ATE with valid confidence interval
ate_cf <- average_treatment_effect(cf)
cat("=== Causal Forest ===\n")
cat("ATE:", ate_cf["estimate"], "(SE:", ate_cf["std.err"], ")\n")
cat("True ATE:", true_ate, "\n")
# Evaluate CATE accuracy against the known true effects
mse <- mean((tau_hat - tau_true)^2)
corr <- cor(tau_hat, tau_true)
cat("CATE MSE:", round(mse, 4), "\n")
cat("CATE correlation:", round(corr, 4), "\n")Expected output:
| Metric | Value |
|---|---|
| ATE (causal forest) | ~1.00 |
| ATE 95% CI | [0.87, 1.13] |
| True ATE | 1.00 |
| CATE MSE | ~0.45 |
| CATE correlation | ~0.92 |
Step 4: Variable Importance and Calibration
# Variable importance: how much each covariate drives splitting
vi <- variable_importance(cf)
cat("=== Variable Importance ===\n")
vi_df <- data.frame(Variable = colnames(X), Importance = vi)
vi_df <- vi_df[order(-vi_df$Importance), ] # Sort by importance
print(vi_df)
# Calibration test: checks if CATE predictions are well-calibrated
# "mean.forest.prediction" should be near ATE; "differential" near 1
cal_test <- test_calibration(cf)
cat("\n=== Calibration ===\n")
print(cal_test)Expected output — Variable importance:
| Covariate | Importance / Corr | True Corr | Note |
|---|---|---|---|
| X0 | ~0.85 | 0.89 | True effect modifier |
| X1 | ~0.42 | 0.45 | True effect modifier |
| X2 | ~0.02 | 0.00 | Noise |
| X3–X9 | ~0.01 | 0.00 | Noise |
Calibration:
Intercept: ~0.05 (ideal: 0)
Slope: ~0.95 (ideal: 1)
R-squared: ~0.85
The causal forest correctly identifies X0 and X1 as treatment effect modifiers, even though X2–X4 also affect Y. Why does the causal forest distinguish between outcome predictors and effect modifiers?
Step 5: Sorted-Group Analysis and Final Comparison
# Sorted-group analysis: bin observations by estimated CATE quintile
tau_q <- cut(tau_hat, breaks = quantile(tau_hat, seq(0,1,0.2)),
include.lowest = TRUE, labels = 1:5)
cat("=== CATE by Quintile ===\n")
# Compare predicted vs. true mean CATE within each quintile
for (q in 1:5) {
idx <- tau_q == q
cat("Q", q, ": hat =", round(mean(tau_hat[idx]), 3),
", true =", round(mean(tau_true[idx]), 3), "\n")
}
cat("\nATE (causal forest):", round(ate_cf["estimate"], 3), "\n")
cat("True ATE:", round(true_ate, 3), "\n")Expected output — Sorted-group analysis:
| Quintile | Mean tau_hat | Mean tau_true | Diff-in-means |
|---|---|---|---|
| 1 (lowest) | -2.30 | -2.15 | -2.10 |
| 2 | -0.40 | -0.35 | -0.50 |
| 3 | 0.95 | 1.00 | 0.90 |
| 4 | 2.30 | 2.40 | 2.55 |
| 5 (highest) | 4.50 | 4.20 | 4.30 |
Wager and Athey (2018) prove that causal forests are pointwise consistent and asymptotically normal. What feature makes pointwise confidence intervals valid?
Summary
The replication of Wager and Athey (2018) demonstrates:
-
Causal forests recover the CATE function. Estimated effects correlate highly (r > 0.9) with the true treatment effect function.
-
Variable importance identifies true effect modifiers. The causal forest correctly distinguishes between covariates that drive heterogeneity and outcome-only predictors.
-
Good calibration. Units predicted to have high effects truly do have high effects.
-
Valid inference. Pointwise confidence intervals are enabled by honesty and the infinitesimal jackknife.
-
Complementary to ATE. The causal forest recovers the ATE and additionally reveals the full distribution of treatment effects.
Extension Exercises
-
Binary treatment effect. Set tau(x) as a step function: tau = 3 if X0 > 0, tau = -1 otherwise. How does the causal forest handle discontinuities?
-
Observational study. Make treatment depend on covariates. Compare the causal forest with and without propensity score adjustment.
-
Best linear projection. Use the
best_linear_projectionfunction in grf to recover a linear approximation of tau(x). Compare with the true CATE coefficients. -
Policy learning. Construct an optimal treatment assignment rule from the estimated CATE. Calculate the value of targeting the top 50% vs. treating everyone.
-
Larger p. Increase covariates to p = 50 or 200 while keeping 2 true effect modifiers. How does variable selection degrade?
-
Coverage study. Run 200 Monte Carlo simulations and check pointwise CI coverage. Does coverage approach 95%?
-
BART comparison. Fit Bayesian Additive Regression Trees and compare CATE estimates. Discuss relative advantages.
-
Real data application. Apply the causal forest to an RCT dataset and identify subgroups with the largest and smallest treatment effects.