Lab: Double/Debiased Machine Learning
Estimate causal effects in the presence of many confounders using double/debiased machine learning. Learn why naive ML prediction fails for causal inference, how cross-fitting solves the overfitting problem, and how to implement the Chernozhukov et al. (2018) DML estimator.
Overview
In this lab you will estimate the causal effect of a binary treatment on an outcome when there are many potential confounders. You will see why simply plugging machine learning predictions into a regression produces biased estimates, and how the DML framework of Chernozhukov et al. (2018) solves this problem using cross-fitting and Neyman orthogonality.
What you will learn:
- Why naive ML-based adjustment produces biased causal estimates (regularization bias)
- How cross-fitting prevents overfitting from contaminating causal inference
- How to implement the partially linear model (PLR) and interactive regression model (IRM) using DML
- How to compare DML with OLS and naive ML approaches
- How to conduct sensitivity analysis for unobserved confounding
Prerequisites: Familiarity with OLS regression and basic machine learning concepts (random forests, cross-validation). Understanding of the potential outcomes framework is helpful.
Step 1: Simulate High-Dimensional Data
We create an observational dataset with 20 covariates, where treatment depends on a subset of them and the outcome depends on a partially overlapping subset.
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(ranger)
set.seed(42)
n <- 3000; p <- 20
X <- matrix(rnorm(n * p), nrow = n)
colnames(X) <- paste0("X", 1:p)
logit_e <- 0.5 * X[,1] - 0.3 * X[,2]^2 + 0.4 * X[,3] * X[,4] + 0.2 * X[,5]
e_true <- plogis(logit_e)
D <- rbinom(n, 1, e_true)
g0 <- 2 * X[,1] + X[,2]^2 - 1.5 * X[,3] + 0.5 * X[,4] * X[,5] +
0.8 * sin(X[,6]) - 0.6 * X[,7] + 0.3 * X[,8]
Y <- 2.0 * D + g0 + rnorm(n)
df <- as.data.frame(X)
df$D <- D; df$Y <- Y
cat("True ATE: 2.0\n")
cat("Naive diff:", mean(Y[D == 1]) - mean(Y[D == 0]), "\n")Expected output:
| Statistic | Value |
|---|---|
| Sample size | 3,000 |
| Number of covariates | 20 |
| Treatment rate | ~45–55% |
| True ATE | 2.000 |
| Naive difference in means | ~2.5–3.5 (biased upward) |
Sample data preview (first 5 rows):
| X1 | X2 | X3 | X4 | X5 | ... | D | Y |
|---|---|---|---|---|---|---|---|
| 0.50 | -0.14 | 0.65 | 1.52 | -0.23 | ... | 1 | 5.82 |
| -0.14 | 0.77 | -0.46 | -0.19 | 0.31 | ... | 0 | 1.03 |
| 0.65 | -0.47 | 1.01 | 0.39 | -0.81 | ... | 1 | 4.21 |
| 1.52 | 0.54 | -0.32 | 0.15 | 0.68 | ... | 1 | 8.37 |
| -0.23 | -1.15 | 0.23 | -0.78 | 0.42 | ... | 0 | -2.15 |
Summary statistics:
| Variable | Mean (D=1) | Mean (D=0) | Difference |
|---|---|---|---|
| Y | ~4.5 | ~1.5 | ~3.0 (biased) |
| X1 | ~0.25 | ~-0.20 | Imbalanced |
| X2 | ~-0.05 | ~0.05 | Slight imbalance |
The naive difference in means exceeds the true ATE of 2.0 because treated units tend to have higher values of X1 (which directly increases Y through g0).
Step 2: Why Naive ML Fails
A natural but wrong approach: use ML to predict Y from (D, X), then read off the coefficient on D.
# OLS with linear controls
ols <- lm(Y ~ D + ., data = df)
cat("OLS:", coef(ols)["D"], "\n")
# Naive RF approach
library(ranger)
rf_naive <- ranger(Y ~ ., data = df, num.trees = 200)
df_d1 <- df; df_d1$D <- 1
df_d0 <- df; df_d0$D <- 0
naive_ml <- mean(predict(rf_naive, df_d1)$predictions -
predict(rf_naive, df_d0)$predictions)
cat("Naive RF:", naive_ml, "\n")
cat("True: 2.0\n")Expected output:
| Approach | Estimate | True ATE | Bias |
|---|---|---|---|
| OLS (linear controls) | ~1.8–2.2 | 2.0 | Small (misses nonlinear g0) |
| Lasso (linear) | ~1.7–2.1 | 2.0 | Some regularization bias |
| Naive RF "effect" | ~1.5–1.9 | 2.0 | Attenuated toward zero |
OLS with linear controls can perform reasonably well here but misses the nonlinear terms in g0 (X2-squared, X3*X4 interaction, sin(X6)). The naive RF approach suffers from regularization bias — the treatment "coefficient" is shrunk toward zero because the random forest does not distinguish between the causal variable D and the confounders X.
You use a random forest to predict Y from (D, X) and compute the average prediction difference between D=1 and D=0. This prediction-based approach gives a biased estimate of the ATE. Which of the following explains the bias?
Step 3: The DML Estimator with Cross-Fitting
DML uses two key ideas: (1) Neyman orthogonality to remove sensitivity to nuisance parameter estimation, and (2) cross-fitting to prevent overfitting.
# DML using the DoubleML package
# Prepare data object
dml_data <- DoubleMLData$new(df, y_col = "Y", d_cols = "D",
x_cols = paste0("X", 1:p))
# Choose ML learners
ml_l <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
ml_m <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
# Partially Linear Model
dml_plr <- DoubleMLPLR$new(dml_data, ml_l, ml_m, n_folds = 5)
dml_plr$fit()
print(dml_plr)
cat("\nDML estimate:", dml_plr$coef, "\n")
cat("SE:", dml_plr$se, "\n")
cat("True: 2.0\n")Expected output:
| DML Partially Linear Model | Value |
|---|---|
| beta_hat (ATE) | ~1.85–2.15 |
| Standard error | ~0.06–0.10 |
| 95% CI lower | ~1.75 |
| 95% CI upper | ~2.25 |
| True ATE | 2.000 |
| Covers true value? | Yes |
The DML estimate should be close to the true ATE of 2.0, with a valid confidence interval that covers the true value. Unlike the naive approaches, DML correctly handles the nonlinear confounding through cross-fitted ML residualization.
Step 4: Compare DML with OLS
# Multiple DML repetitions
dml_plr_multi <- DoubleMLPLR$new(dml_data, ml_l, ml_m, n_folds = 5, n_rep = 50)
dml_plr_multi$fit()
cat("DML (median over 50 reps):", median(dml_plr_multi$all_coef), "\n")
cat("OLS:", coef(ols)["D"], "\n")
cat("True: 2.0\n")
# Confidence intervals
confint(dml_plr_multi)Expected output:
| Estimator | Estimate | SD across splits |
|---|---|---|
| DML (median over 50 reps) | ~2.00 | ~0.06 |
| OLS (linear controls) | ~1.9–2.1 | — |
| True ATE | 2.000 | — |
Step 5: Interactive Regression Model (IRM)
The IRM variant is useful when the treatment effect may be heterogeneous. It uses an AIPW-style score function.
# IRM variant in DoubleML
ml_g_irm <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
ml_m_irm <- lrn("classif.ranger", num.trees = 200, max.depth = 10)
dml_irm <- DoubleMLIRM$new(dml_data, ml_g_irm, ml_m_irm, n_folds = 5)
dml_irm$fit()
print(dml_irm)
cat("IRM estimate:", dml_irm$coef, "\n")
cat("PLR estimate:", dml_plr$coef, "\n")Expected output:
| DML-IRM (AIPW) | Value |
|---|---|
| ATE estimate | ~1.85–2.15 |
| Standard error | ~0.06–0.10 |
| 95% CI lower | ~1.75 |
| 95% CI upper | ~2.25 |
| True ATE | 2.000 |
| Comparison of Estimators | Estimate |
|---|---|
| Naive difference | ~2.5–3.5 (biased) |
| OLS (linear controls) | ~1.8–2.2 |
| Naive RF | ~1.5–1.9 (attenuated) |
| DML-PLR | ~1.9–2.1 |
| DML-IRM (AIPW) | ~1.9–2.1 |
| True ATE | 2.000 |
The IRM estimate uses an AIPW-style score function, which is doubly robust: it is consistent if either the outcome model or the propensity score model is correctly specified. In this DGP with a constant treatment effect, the PLR and IRM estimates should be similar.
DML uses cross-fitting (sample splitting). Why not just split the data in half, estimate nuisance functions on one half, and estimate the causal parameter on the other?
Exercises
-
Try different ML learners. Replace random forests with gradient boosting, LASSO, or neural networks for the nuisance functions. How sensitive is the DML estimate to the choice of learner?
-
Vary the number of confounders. Increase p to 100 (with only 8 truly relevant) and add irrelevant noise variables. Does DML still work?
-
Group ATE. Estimate group-specific treatment effects (e.g., for above-median vs. below-median X1) using DML. Compare with a linear interaction model.
-
Sensitivity analysis. Implement the Chernozhukov et al. (2022) sensitivity analysis framework to assess how robust the DML estimate is to unobserved confounding.
Summary
In this lab you learned:
- Naive ML adjustment for confounders produces biased causal estimates due to regularization bias and overfitting
- DML overcomes these problems using Neyman-orthogonal score functions and K-fold cross-fitting
- The partially linear model (PLR) assumes a constant additive treatment effect; the interactive regression model (IRM) allows for heterogeneous effects
- Cross-fitting ensures all observations are used efficiently while preventing overfitting bias
- Multiple random splits (repetitions) reduce sensitivity to any particular fold assignment
- DML is a general framework that works with any ML method (random forests, LASSO, neural networks, boosting) for nuisance estimation