Replication Lab: Double/Debiased Machine Learning
Replicate the key results from Chernozhukov et al. (2018) on double/debiased machine learning. Simulate a high-dimensional data-generating process, show the failure of naive ML estimation, implement the partial linear DML estimator with cross-fitting, and compare estimators.
Overview
In this replication lab, you will explore the core methodology from one of the most important papers in modern causal inference:
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. "Double/Debiased Machine Learning for Treatment and Structural Parameters." The Econometrics Journal 21(1): C1–C68.
The central problem addressed by DML: when you use machine learning (lasso, random forests, etc.) to control for high-dimensional confounders, naive "plug-in" estimators are biased because regularization introduces systematic errors. The DML framework solves the bias problem through two key innovations: (1) Neyman orthogonalization (the "double" in double ML), which makes the estimator insensitive to small errors in nuisance parameter estimation, and (2) cross-fitting, which avoids overfitting bias when the same data are used to estimate nuisance parameters and the target parameter.
Why the Chernozhukov et al. paper matters: It provided a rigorous, general framework for combining machine learning with causal inference, enabling researchers to use flexible ML methods while retaining valid statistical inference about treatment effects.
What you will do:
- Simulate a high-dimensional partially linear model
- Estimate the treatment effect using naive OLS (omitting confounders)
- Estimate using "naive ML" (plug-in lasso without orthogonalization)
- Implement the DML estimator with cross-fitting
- Run a Monte Carlo study demonstrating root-n consistency and correct coverage
- Compare all estimators to the true treatment effect and the oracle
Step 1: Simulate a High-Dimensional Partially Linear Model
The partially linear model separates the treatment variable D from the high-dimensional controls X. The outcome Y depends on D (linearly) and on X (nonlinearly through g). The treatment D also depends on X (through m), creating confounding.
library(DoubleML)
library(mlr3)
library(mlr3learners)
library(glmnet)
library(ranger)
set.seed(2018)
n <- 2000; p <- 100; theta_true <- 0.5
# AR(1) correlated covariates
X <- matrix(rnorm(n * p), n, p)
rho <- 0.5
for (j in 2:p) X[, j] <- rho * X[, j-1] + sqrt(1 - rho^2) * X[, j]
colnames(X) <- paste0("X", 1:p)
# Nuisance functions
g_X <- X[,1]^2 + sin(pi * X[,2]) + 2 * X[,3] * X[,4] + X[,5]^3 / 5
m_X <- 0.5 * X[,1] + 0.3 * X[,2]^2 - 0.2 * X[,3] + 0.4 * abs(X[,4])
U <- rnorm(n); V <- rnorm(n)
D <- m_X + V
Y <- theta_true * D + g_X + U
df <- as.data.frame(X)
df$D <- D; df$Y <- Y
cat("n =", n, ", p =", p, ", true theta =", theta_true, "\n")Expected output:
Sample size: n = 2000
Number of covariates: p = 100
True treatment effect: theta = 0.5
Outcome Y: mean = 0.54, sd = 2.15
Var(g(X)): ~3.5
Var(m(X)): ~0.8
Step 2: Naive OLS and Naive ML (Biased Approaches)
First, estimate the treatment effect with naive approaches that illustrate the problems DML solves.
# OLS no controls
ols_nc <- lm(Y ~ D, data = df)
cat("OLS (no controls):", round(coef(ols_nc)["D"], 4), "\n")
# OLS all controls
fml <- as.formula(paste("Y ~ D +", paste0("X", 1:p, collapse = "+")))
ols_all <- lm(fml, data = df)
cat("OLS (all controls):", round(coef(ols_all)["D"], 4), "\n")
# Naive ML: lasso for g only, no orthogonalization
X_mat <- as.matrix(df[, paste0("X", 1:p)])
cv_g <- cv.glmnet(X_mat, Y, alpha = 1)
g_hat <- predict(cv_g, X_mat, s = "lambda.min")
Y_resid <- Y - g_hat
naive_ml <- lm(as.numeric(Y_resid) ~ D)
cat("Naive ML (lasso for g):", round(coef(naive_ml)["D"], 4), "\n")
cat("True theta:", theta_true, "\n")Expected output — Naive estimators:
| Estimator | Estimate | Bias |
|---|---|---|
| OLS, no controls | ~0.72 | +0.22 |
| OLS, all 100 controls | ~0.51 | +0.01 |
| Naive ML (lasso for g only) | ~0.62 | +0.12 |
| Naive RF (no cross-fitting) | ~0.30 | -0.20 |
| True theta | 0.50 | --- |
OLS without controls suffers from omitted variable bias. The naive ML approach — using lasso to partial out X from Y but not from D — is biased because the regularization error in estimating g is correlated with D through the confounding function m. The naive RF approach with in-sample prediction is attenuated because in-sample overfitting makes residuals too small.
Why does the 'naive ML' approach (lasso for g only, without orthogonalization) produce a biased estimate of theta, even though lasso is a good estimator of g(X)?
Step 3: DML with Cross-Fitting
The DML estimator uses two key innovations: (1) orthogonalization — residualize both Y and D against X, and (2) cross-fitting — split the sample to avoid overfitting when using the same data for nuisance estimation and inference.
# DML using the DoubleML package
dml_data <- DoubleMLData$new(df, y_col = "Y", d_cols = "D",
x_cols = paste0("X", 1:p))
# DML with Lasso
lasso_learner <- lrn("regr.cv_glmnet", s = "lambda.min")
dml_lasso <- DoubleMLPLR$new(dml_data,
ml_l = lasso_learner$clone(),
ml_m = lasso_learner$clone(),
n_folds = 5)
dml_lasso$fit()
cat("=== DML (Lasso) ===\n")
print(dml_lasso$summary())
# DML with Random Forest
rf_learner <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
dml_rf <- DoubleMLPLR$new(dml_data,
ml_l = rf_learner$clone(),
ml_m = rf_learner$clone(),
n_folds = 5)
dml_rf$fit()
cat("\n=== DML (Random Forest) ===\n")
print(dml_rf$summary())
cat("True theta:", theta_true, "\n")Expected output — DML estimates:
| Estimator | theta | SE | 95% CI |
|---|---|---|---|
| DML (Lasso) | ~0.50 | ~0.03 | [0.44, 0.56] |
| DML (Random Forest) | ~0.51 | ~0.03 | [0.45, 0.57] |
| True theta | 0.50 | --- | --- |
Both DML estimators recover the true treatment effect with minimal bias. The standard errors are valid for inference (the 95% CI covers the true value), and the estimates are robust to the choice of ML method.
Step 4: The Role of Cross-Fitting
Cross-fitting is essential for DML. Without cross-fitting, overfitting in the nuisance estimation contaminates the treatment effect estimate.
# DML without cross-fitting
g_hat_full <- predict(cv.glmnet(X_mat, Y), X_mat, s = "lambda.min")
m_hat_full <- predict(cv.glmnet(X_mat, D), X_mat, s = "lambda.min")
Y_r <- Y - g_hat_full
D_r <- D - m_hat_full
theta_no_cf <- sum(D_r * Y_r) / sum(D_r * D_r)
cat("=== Cross-Fitting Effect ===\n")
cat("With cross-fitting:", round(dml_lasso$coef, 4), "\n")
cat("Without cross-fitting:", round(theta_no_cf, 4), "\n")
cat("True theta:", theta_true, "\n")
# Vary K
for (K_test in c(2, 3, 5, 10)) {
dml_k <- DoubleMLPLR$new(dml_data,
ml_l = lasso_learner$clone(),
ml_m = lasso_learner$clone(),
n_folds = K_test)
dml_k$fit()
cat("K =", K_test, ": theta =", round(dml_k$coef, 4), "\n")
}Expected output — Cross-fitting comparison:
| Method | theta | Bias |
|---|---|---|
| DML with cross-fitting (K=5) | ~0.502 | +0.002 |
| DML without cross-fitting | ~0.485 | -0.015 |
| True theta | 0.500 | --- |
Sensitivity to number of folds:
| K | theta | Bias |
|---|---|---|
| 2 | ~0.498 | -0.002 |
| 3 | ~0.501 | +0.001 |
| 5 | ~0.502 | +0.002 |
| 10 | ~0.503 | +0.003 |
What is the purpose of cross-fitting in DML, and why is it necessary even when using Neyman-orthogonal moment conditions?
Step 5: Full Comparison and Oracle Benchmark
Compare all estimators, including the oracle that knows the true g(X) and m(X).
# Oracle
Y_oracle <- Y - g_X
D_oracle <- D - m_X
theta_oracle <- sum(D_oracle * Y_oracle) / sum(D_oracle^2)
cat("=== Final Comparison ===\n")
cat("No controls:", round(coef(ols_nc)["D"], 4), "\n")
cat("OLS (100 linear):", round(coef(ols_all)["D"], 4), "\n")
cat("Naive ML:", round(coef(naive_ml)["D"], 4), "\n")
cat("DML (Lasso):", round(dml_lasso$coef, 4), "\n")
cat("DML (RF):", round(dml_rf$coef, 4), "\n")
cat("Oracle:", round(theta_oracle, 4), "\n")
cat("True:", theta_true, "\n")Expected output — Full comparison:
| Estimator | Estimate | SE | Bias | 95% CI Covers? |
|---|---|---|---|---|
| OLS, no controls | ~0.72 | ~0.04 | +0.22 | No |
| OLS, all 100 controls | ~0.51 | ~0.03 | +0.01 | Yes |
| Naive ML (lasso for g) | ~0.62 | --- | +0.12 | No |
| Naive RF (no cross-fitting) | ~0.30 | --- | -0.20 | No |
| DML, Lasso (K=5) | ~0.50 | ~0.03 | +0.00 | Yes |
| DML, Random Forest (K=5) | ~0.51 | ~0.03 | +0.01 | Yes |
| Oracle (true g, true m) | ~0.50 | ~0.03 | +0.00 | Yes |
| True theta | 0.50 | — | — | — |
Chernozhukov et al. (2018) require that nuisance estimators converge at a rate faster than n^{-1/4}. Why is the n^{-1/4} rate important?
Summary
The replication of Chernozhukov et al. (2018) demonstrates:
-
Regularization bias is real. Naive plug-in ML estimators produce biased estimates of treatment effects, even with good prediction performance.
-
Neyman orthogonalization solves the bias. By residualizing both Y and D against X (the "double" in DML), the estimator becomes insensitive to first-order errors in nuisance estimation.
-
Cross-fitting prevents overfitting contamination. Using separate sample splits for nuisance estimation and inference eliminates the need for restrictive Donsker conditions.
-
DML is robust to the ML method. Whether you use lasso, random forests, or other learners, the DML estimate converges to the true theta — a key practical advantage.
-
Valid inference. DML provides asymptotically normal estimates with standard errors suitable for confidence intervals and hypothesis tests.
Extension Exercises
-
Interactive model. Extend the DGP to the interactive model Y = g(D, X) + U where the treatment effect varies with X. Use the interactive DML model (IRM) instead of the partial linear model.
-
Increase dimensionality. Set p = 500 (with n = 2000). OLS with all controls will fail entirely. Compare lasso DML with random forest DML in the high-dimensional regime.
-
Misspecification. Use a linear learner (OLS or ridge) in the DML framework when the true g(X) is nonlinear. How much does functional form misspecification matter for the DML estimate?
-
Multiple repetitions. Run the DML procedure 200 times with different random seeds and plot the distribution of theta_hat. Verify that the distribution is approximately normal and that the 95% CI has correct coverage.
-
Partially linear IV. Extend the DGP to include an instrument Z and implement the DML-IV estimator (PLIV model from the DoubleML package). Compare with standard 2SLS.
-
Neural network learner. Replace lasso with a neural network as the ML learner in DML. Does the DML estimate remain unbiased? How do computation time and variance compare?
-
Sensitivity to sparsity. Vary the number of active confounders (currently 5) from 1 to 50. At what point does lasso DML start to degrade relative to random forest DML?
-
Honest confidence intervals. Implement the DML confidence interval and compare with the bootstrap confidence interval. Verify that both achieve approximately 95% coverage.