Lab·tutorial·6 min read

tutorial120 minutes

Lab: Double/Debiased Machine Learning

Estimate causal effects under many confounders via double/debiased ML: why naive ML fails, how cross-fitting fixes it, and the Chernozhukov DML estimator.

Method: Double/Debiased Machine Learning (DML)
Languages: Python, R, Stata
Dataset: Simulated high-dimensional observational data

Overview

In this lab you will estimate the causal effect of a binary treatment on an outcome when there are many potential confounders. You will see why simply plugging machine learning predictions into a regression produces biased estimates, and how the DML framework of Chernozhukov et al. (2018) solves this problem using cross-fitting and Neyman orthogonality.

What you will learn:

Why naive ML-based adjustment produces biased causal estimates (regularization bias)
How cross-fitting prevents overfitting from contaminating causal inference
How to implement the partially linear regression (PLR) and interactive regression model (IRM) using DML
How to compare DML with OLS and naive ML approaches
How to conduct sensitivity analysis for unobserved confounding

Prerequisites: Familiarity with OLS regression and basic machine learning concepts (random forests, cross-validation). Understanding of the potential outcomes framework is helpful.

Step 1: Simulate High-Dimensional Data

We create an observational dataset with 20 covariates, where treatment depends on a subset of them and the outcome depends on a partially overlapping subset.

1# First-time setup: install.packages(c("DoubleML", "mlr3", "mlr3learners", "ranger"))
2library(DoubleML)
3library(mlr3)
4library(mlr3learners)
5library(ranger)
6
7set.seed(42)
8n <- 3000; p <- 20
9
10X <- matrix(rnorm(n * p), nrow = n)
11colnames(X) <- paste0("X", 1:p)
12
13logit_e <- 0.5 * X[,1] - 0.3 * X[,2]^2 + 0.4 * X[,3] * X[,4] + 0.2 * X[,5]
14e_true <- plogis(logit_e)
15D <- rbinom(n, 1, e_true)
16
17g0 <- 2 * X[,1] + X[,2]^2 - 1.5 * X[,3] + 0.5 * X[,4] * X[,5] +
18    0.8 * sin(X[,6]) - 0.6 * X[,7] + 0.3 * X[,8]
19Y <- 2.0 * D + g0 + rnorm(n)
20
21df <- as.data.frame(X)
22df$D <- D; df$Y <- Y
23
24cat("True ATE: 2.0\n")
25cat("Naive diff:", mean(Y[D == 1]) - mean(Y[D == 0]), "\n")

RequiresDoubleML mlr3 mlr3learners ranger

Expected output:

Statistic	Value
Sample size	3,000
Number of covariates	20
Treatment rate	~45–55%
True ATE	2.000
Naive difference in means	~2.5–3.5 (biased upward)

Sample data preview (first 5 rows):

X1	X2	X3	X4	X5	...	D	Y
0.50	-0.14	0.65	1.52	-0.23	...	1	5.82
-0.14	0.77	-0.46	-0.19	0.31	...	0	1.03
0.65	-0.47	1.01	0.39	-0.81	...	1	4.21
1.52	0.54	-0.32	0.15	0.68	...	1	8.37
-0.23	-1.15	0.23	-0.78	0.42	...	0	-2.15

Summary statistics:

Variable	Mean (D=1)	Mean (D=0)	Difference
Y	~4.5	~1.5	~3.0 (biased)
X1	~0.25	~-0.20	Imbalanced
X2	~-0.05	~0.05	Slight imbalance

The naive difference in means exceeds the true ATE of 2.0 because treated units tend to have higher values of X1 (which directly increases Y through g0).

Step 2: Why Naive ML Fails

A natural but wrong approach: use ML to predict Y from (D, X), then read off the coefficient on D.

1# First-time setup: install.packages(c("ranger"))
2# OLS with linear controls
3ols <- lm(Y ~ D + ., data = df)
4cat("OLS:", coef(ols)["D"], "\n")
5
6# Naive RF approach
7library(ranger)
8rf_naive <- ranger(Y ~ ., data = df, num.trees = 200)
9df_d1 <- df; df_d1$D <- 1
10df_d0 <- df; df_d0$D <- 0
11naive_ml <- mean(predict(rf_naive, df_d1)$predictions -
12               predict(rf_naive, df_d0)$predictions)
13cat("Naive RF:", naive_ml, "\n")
14cat("True: 2.0\n")

Requiresranger

Expected output:

Approach	Estimate	True ATE	Bias
OLS (linear controls)	~1.8–2.2	2.0	Small (misses nonlinear g0)
Lasso (linear)	~1.7–2.1	2.0	Some regularization bias
Naive RF "effect"	~1.5–1.9	2.0	Attenuated toward zero

OLS with linear controls can perform reasonably well here but misses the nonlinear terms in g0 (X2-squared, X3*X4 interaction, sin(X6)). The naive RF approach suffers from regularization bias — the treatment "coefficient" is shrunk toward zero because the random forest does not distinguish between the causal variable D and the confounders X.

Concept Check

You use a random forest to predict Y from (D, X) and compute the average prediction difference between D=1 and D=0. This prediction-based approach gives a biased estimate of the ATE. Which of the following explains the bias?

Random forests cannot handle binary treatment variables.The sample size is too small for ML methods.ML regularization (e.g., tree depth limits) creates a bias in the treatment coefficient that does not vanish at the standard rate, and in-sample predictions are overfit.The true effect is nonlinear and random forests assume linearity.

Step 3: The DML Estimator with Cross-Fitting

DML uses two key ideas: (1) Neyman orthogonality to remove sensitivity to nuisance parameter estimation, and (2) cross-fitting to prevent overfitting.

1# DML using the DoubleML package
2# Prepare data object
3dml_data <- DoubleMLData$new(df, y_col = "Y", d_cols = "D",
4                            x_cols = paste0("X", 1:p))
5
6# Choose ML learners
7ml_l <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
8ml_m <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
9
10# Partially Linear Model
11# n_folds: K-fold cross-fitting within each repetition
12# n_rep: number of independent cross-fits; the DML estimator aggregates
13#   across repetitions (median by default in DoubleML >= 0.5).
14#   Multiple repetitions reduce variance from the random fold assignment.
15dml_plr <- DoubleMLPLR$new(dml_data, ml_l, ml_m,
16                          n_folds = 5, n_rep = 50)
17dml_plr$fit()
18print(dml_plr)
19
20cat("\nDML estimate (median over 50 repetitions):", dml_plr$coef, "\n")
21cat("SE:", dml_plr$se, "\n")
22cat("True: 2.0\n")

RequiresDoubleML

Expected output:

DML Partially Linear Model	Value
beta_hat (ATE)	~1.85–2.15
Standard error	~0.06–0.10
95% CI lower	~1.75
95% CI upper	~2.25
True ATE	2.000
Covers true value?	Yes

The DML estimate should be close to the true ATE of 2.0, with a valid confidence interval that covers the true value. Unlike the naive approaches, DML correctly handles the nonlinear confounding through cross-fitted ML residualization.

Step 4: Compare DML with OLS

1# Multiple DML repetitions
2dml_plr_multi <- DoubleMLPLR$new(dml_data, ml_l, ml_m, n_folds = 5, n_rep = 50)
3dml_plr_multi$fit()
4
5cat("DML (median over 50 reps):", median(dml_plr_multi$all_coef), "\n")
6cat("OLS:", coef(ols)["D"], "\n")
7cat("True: 2.0\n")
8
9# Confidence intervals
10confint(dml_plr_multi)

Expected visualization: DML Estimates Across Random Splits

A histogram of 50 DML estimates from different random fold assignments:

Blue histogram bars: The distribution of DML estimates across 50 random splits, centered near the true ATE of 2.0. The distribution is approximately normal with a standard deviation of about 0.05–0.10.
Red vertical line: The true ATE at 2.0.
Blue dashed vertical line: The median DML estimate, approximately 2.0 (very close to the truth).
Green dotted vertical line: The OLS estimate, approximately 1.8–2.2. If the OLS estimate is biased due to nonlinear confounding, it will be offset from the DML center.
Key takeaway: The DML estimates cluster tightly around the true value, demonstrating that cross-fitting with ML nuisance estimation recovers the correct causal effect despite nonlinear confounding.

Expected output:

Estimator	Estimate	SD across splits
DML (median over 50 reps)	~2.00	~0.06
OLS (linear controls)	~1.9–2.1	—
True ATE	2.000	—

Step 5: Interactive Regression Model (IRM)

The IRM variant is useful when the treatment effect may be heterogeneous. It uses an augmented inverse probability weighting (AIPW)-style score function.

1# IRM variant in DoubleML
2ml_g_irm <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
3ml_m_irm <- lrn("classif.ranger", num.trees = 200, max.depth = 10)
4
5dml_irm <- DoubleMLIRM$new(dml_data, ml_g_irm, ml_m_irm, n_folds = 5)
6dml_irm$fit()
7print(dml_irm)
8
9cat("IRM estimate:", dml_irm$coef, "\n")
10cat("PLR estimate:", dml_plr$coef, "\n")

RequiresDoubleML

Expected output:

DML-IRM (AIPW)	Value
ATE estimate	~1.85–2.15
Standard error	~0.06–0.10
95% CI lower	~1.75
95% CI upper	~2.25
True ATE	2.000

Comparison of Estimators	Estimate
Naive difference	~2.5–3.5 (biased)
OLS (linear controls)	~1.8–2.2
Naive RF	~1.5–1.9 (attenuated)
DML-PLR	~1.9–2.1
DML-IRM (AIPW)	~1.9–2.1
True ATE	2.000

The IRM estimate uses an AIPW-style score function, which is doubly robust: it is consistent if either the outcome model or the propensity score model is correctly specified. In this DGP with a constant treatment effect, the PLR and IRM estimates should be similar.

Concept Check

DML uses cross-fitting (sample splitting). Why not just split the data in half, estimate nuisance functions on one half, and estimate the causal parameter on the other?

A simple split would produce incorrect standard errors.A simple split wastes half the data for causal estimation, reducing efficiency. K-fold cross-fitting uses all observations for both nuisance estimation and causal estimation.A simple split makes the estimator biased.Cross-fitting is required for the Neyman orthogonality property to hold.

Exercises

Try different ML learners. Replace random forests with gradient boosting, LASSO, or neural networks for the nuisance functions. How sensitive is the DML estimate to the choice of learner?
Vary the number of confounders. Increase p to 100 (with only 8 truly relevant) and add irrelevant noise variables. Does DML still work?
Group ATE. Estimate group-specific treatment effects (e.g., for above-median vs. below-median X1) using DML. Compare with a linear interaction model.
Sensitivity analysis. Implement the Chernozhukov et al. (2022) sensitivity analysis framework to assess how robust the DML estimate is to unobserved confounding.

Expected output

If your code runs correctly, expect to see:

Naive difference in means: Biased, not equal to 2.0 (because treatment assignment depends on X1–X5)
OLS with linear controls: Around 1.5–2.5 but biased because the true g0 function is nonlinear
Naive ML (coefficient from ML prediction): Biased toward zero due to regularization bias
DML (partially linear model): Around 1.8–2.2 (true value: 2.0), with valid confidence interval covering the true value
DML (interactive regression model): Around 1.8–2.2, similar to PLR with this DGP
Cross-fitting: 5-fold cross-fitting with multiple repetitions; results should be stable across repetitions
DML standard error: Valid confidence intervals that cover the true value, reflecting DML's bias reduction from flexible nuisance estimation
Treatment rate: Around 45–55%
Sample size: 3,000 observations with 20 covariates

Summary

In this lab you learned:

Naive ML adjustment for confounders produces biased causal estimates due to regularization bias and overfitting
DML overcomes these problems using Neyman-orthogonal score functions and K-fold cross-fitting
The partially linear model (PLR) assumes a constant additive treatment effect; the interactive regression model (IRM) allows for heterogeneous effects
Cross-fitting ensures all observations are used efficiently while preventing overfitting bias
Multiple random splits (repetitions) reduce sensitivity to any particular fold assignment
DML is a general framework that works with any ML method (random forests, LASSO, neural networks, boosting) for nuisance estimation

Overview#

Step 1: Simulate High-Dimensional Data#

Step 2: Why Naive ML Fails#

Step 3: The DML Estimator with Cross-Fitting#

Step 4: Compare DML with OLS#

Step 5: Interactive Regression Model (IRM)#

Exercises#

Summary#

Overview

Step 1: Simulate High-Dimensional Data

Step 2: Why Naive ML Fails

Step 3: The DML Estimator with Cross-Fitting

Step 4: Compare DML with OLS

Step 5: Interactive Regression Model (IRM)

Exercises

Summary