MethodAtlas
replication120 minutes

Replication Lab: Double/Debiased Machine Learning

Replicate the key results from Chernozhukov et al. (2018) on double/debiased machine learning. Simulate a high-dimensional data-generating process, show the failure of naive ML estimation, implement the partial linear DML estimator with cross-fitting, and compare estimators.

Overview

In this replication lab, you will explore the core methodology from one of the most important papers in modern causal inference:

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. "Double/Debiased Machine Learning for Treatment and Structural Parameters." The Econometrics Journal 21(1): C1–C68.

The central problem addressed by DML: when you use machine learning (lasso, random forests, etc.) to control for high-dimensional confounders, naive "plug-in" estimators are biased because regularization introduces systematic errors. The DML framework solves the bias problem through two key innovations: (1) Neyman orthogonalization (the "double" in double ML), which makes the estimator insensitive to small errors in nuisance parameter estimation, and (2) cross-fitting, which avoids overfitting bias when the same data are used to estimate nuisance parameters and the target parameter.

Why the Chernozhukov et al. paper matters: It provided a rigorous, general framework for combining machine learning with causal inference, enabling researchers to use flexible ML methods while retaining valid statistical inference about treatment effects.

What you will do:

  • Simulate a high-dimensional partially linear model
  • Estimate the treatment effect using naive OLS (omitting confounders)
  • Estimate using "naive ML" (plug-in lasso without orthogonalization)
  • Implement the DML estimator with cross-fitting
  • Run a Monte Carlo study demonstrating root-n consistency and correct coverage
  • Compare all estimators to the true treatment effect and the oracle

Step 1: Simulate a High-Dimensional Partially Linear Model

The partially linear model separates the treatment variable D from the high-dimensional controls X. The outcome Y depends on D (linearly) and on X (nonlinearly through g). The treatment D also depends on X (through m), creating confounding.

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(glmnet)
library(ranger)

set.seed(2018)

n <- 2000; p <- 100; theta_true <- 0.5

# AR(1) correlated covariates
X <- matrix(rnorm(n * p), n, p)
rho <- 0.5
for (j in 2:p) X[, j] <- rho * X[, j-1] + sqrt(1 - rho^2) * X[, j]
colnames(X) <- paste0("X", 1:p)

# Nuisance functions
g_X <- X[,1]^2 + sin(pi * X[,2]) + 2 * X[,3] * X[,4] + X[,5]^3 / 5
m_X <- 0.5 * X[,1] + 0.3 * X[,2]^2 - 0.2 * X[,3] + 0.4 * abs(X[,4])

U <- rnorm(n); V <- rnorm(n)
D <- m_X + V
Y <- theta_true * D + g_X + U

df <- as.data.frame(X)
df$D <- D; df$Y <- Y

cat("n =", n, ", p =", p, ", true theta =", theta_true, "\n")

Expected output:

Sample size: n = 2000
Number of covariates: p = 100
True treatment effect: theta = 0.5

Outcome Y: mean = 0.54, sd = 2.15
Var(g(X)): ~3.5
Var(m(X)): ~0.8

Step 2: Naive OLS and Naive ML (Biased Approaches)

First, estimate the treatment effect with naive approaches that illustrate the problems DML solves.

# OLS no controls
ols_nc <- lm(Y ~ D, data = df)
cat("OLS (no controls):", round(coef(ols_nc)["D"], 4), "\n")

# OLS all controls
fml <- as.formula(paste("Y ~ D +", paste0("X", 1:p, collapse = "+")))
ols_all <- lm(fml, data = df)
cat("OLS (all controls):", round(coef(ols_all)["D"], 4), "\n")

# Naive ML: lasso for g only, no orthogonalization
X_mat <- as.matrix(df[, paste0("X", 1:p)])
cv_g <- cv.glmnet(X_mat, Y, alpha = 1)
g_hat <- predict(cv_g, X_mat, s = "lambda.min")
Y_resid <- Y - g_hat
naive_ml <- lm(as.numeric(Y_resid) ~ D)
cat("Naive ML (lasso for g):", round(coef(naive_ml)["D"], 4), "\n")
cat("True theta:", theta_true, "\n")

Expected output — Naive estimators:

EstimatorEstimateBias
OLS, no controls~0.72+0.22
OLS, all 100 controls~0.51+0.01
Naive ML (lasso for g only)~0.62+0.12
Naive RF (no cross-fitting)~0.30-0.20
True theta0.50---

OLS without controls suffers from omitted variable bias. The naive ML approach — using lasso to partial out X from Y but not from D — is biased because the regularization error in estimating g is correlated with D through the confounding function m. The naive RF approach with in-sample prediction is attenuated because in-sample overfitting makes residuals too small.

Concept Check

Why does the 'naive ML' approach (lasso for g only, without orthogonalization) produce a biased estimate of theta, even though lasso is a good estimator of g(X)?


Step 3: DML with Cross-Fitting

The DML estimator uses two key innovations: (1) orthogonalization — residualize both Y and D against X, and (2) cross-fitting — split the sample to avoid overfitting when using the same data for nuisance estimation and inference.

# DML using the DoubleML package
dml_data <- DoubleMLData$new(df, y_col = "Y", d_cols = "D",
                            x_cols = paste0("X", 1:p))

# DML with Lasso
lasso_learner <- lrn("regr.cv_glmnet", s = "lambda.min")
dml_lasso <- DoubleMLPLR$new(dml_data,
                            ml_l = lasso_learner$clone(),
                            ml_m = lasso_learner$clone(),
                            n_folds = 5)
dml_lasso$fit()
cat("=== DML (Lasso) ===\n")
print(dml_lasso$summary())

# DML with Random Forest
rf_learner <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
dml_rf <- DoubleMLPLR$new(dml_data,
                         ml_l = rf_learner$clone(),
                         ml_m = rf_learner$clone(),
                         n_folds = 5)
dml_rf$fit()
cat("\n=== DML (Random Forest) ===\n")
print(dml_rf$summary())
cat("True theta:", theta_true, "\n")
RequiresDoubleML

Expected output — DML estimates:

EstimatorthetaSE95% CI
DML (Lasso)~0.50~0.03[0.44, 0.56]
DML (Random Forest)~0.51~0.03[0.45, 0.57]
True theta0.50------

Both DML estimators recover the true treatment effect with minimal bias. The standard errors are valid for inference (the 95% CI covers the true value), and the estimates are robust to the choice of ML method.


Step 4: The Role of Cross-Fitting

Cross-fitting is essential for DML. Without cross-fitting, overfitting in the nuisance estimation contaminates the treatment effect estimate.

# DML without cross-fitting
g_hat_full <- predict(cv.glmnet(X_mat, Y), X_mat, s = "lambda.min")
m_hat_full <- predict(cv.glmnet(X_mat, D), X_mat, s = "lambda.min")

Y_r <- Y - g_hat_full
D_r <- D - m_hat_full
theta_no_cf <- sum(D_r * Y_r) / sum(D_r * D_r)

cat("=== Cross-Fitting Effect ===\n")
cat("With cross-fitting:", round(dml_lasso$coef, 4), "\n")
cat("Without cross-fitting:", round(theta_no_cf, 4), "\n")
cat("True theta:", theta_true, "\n")

# Vary K
for (K_test in c(2, 3, 5, 10)) {
dml_k <- DoubleMLPLR$new(dml_data,
                           ml_l = lasso_learner$clone(),
                           ml_m = lasso_learner$clone(),
                           n_folds = K_test)
dml_k$fit()
cat("K =", K_test, ": theta =", round(dml_k$coef, 4), "\n")
}

Expected output — Cross-fitting comparison:

MethodthetaBias
DML with cross-fitting (K=5)~0.502+0.002
DML without cross-fitting~0.485-0.015
True theta0.500---

Sensitivity to number of folds:

KthetaBias
2~0.498-0.002
3~0.501+0.001
5~0.502+0.002
10~0.503+0.003
Concept Check

What is the purpose of cross-fitting in DML, and why is it necessary even when using Neyman-orthogonal moment conditions?


Step 5: Full Comparison and Oracle Benchmark

Compare all estimators, including the oracle that knows the true g(X) and m(X).

# Oracle
Y_oracle <- Y - g_X
D_oracle <- D - m_X
theta_oracle <- sum(D_oracle * Y_oracle) / sum(D_oracle^2)

cat("=== Final Comparison ===\n")
cat("No controls:", round(coef(ols_nc)["D"], 4), "\n")
cat("OLS (100 linear):", round(coef(ols_all)["D"], 4), "\n")
cat("Naive ML:", round(coef(naive_ml)["D"], 4), "\n")
cat("DML (Lasso):", round(dml_lasso$coef, 4), "\n")
cat("DML (RF):", round(dml_rf$coef, 4), "\n")
cat("Oracle:", round(theta_oracle, 4), "\n")
cat("True:", theta_true, "\n")

Expected output — Full comparison:

EstimatorEstimateSEBias95% CI Covers?
OLS, no controls~0.72~0.04+0.22No
OLS, all 100 controls~0.51~0.03+0.01Yes
Naive ML (lasso for g)~0.62---+0.12No
Naive RF (no cross-fitting)~0.30----0.20No
DML, Lasso (K=5)~0.50~0.03+0.00Yes
DML, Random Forest (K=5)~0.51~0.03+0.01Yes
Oracle (true g, true m)~0.50~0.03+0.00Yes
True theta0.50
Concept Check

Chernozhukov et al. (2018) require that nuisance estimators converge at a rate faster than n^{-1/4}. Why is the n^{-1/4} rate important?


Summary

The replication of Chernozhukov et al. (2018) demonstrates:

  1. Regularization bias is real. Naive plug-in ML estimators produce biased estimates of treatment effects, even with good prediction performance.

  2. Neyman orthogonalization solves the bias. By residualizing both Y and D against X (the "double" in DML), the estimator becomes insensitive to first-order errors in nuisance estimation.

  3. Cross-fitting prevents overfitting contamination. Using separate sample splits for nuisance estimation and inference eliminates the need for restrictive Donsker conditions.

  4. DML is robust to the ML method. Whether you use lasso, random forests, or other learners, the DML estimate converges to the true theta — a key practical advantage.

  5. Valid inference. DML provides asymptotically normal estimates with standard errors suitable for confidence intervals and hypothesis tests.


Extension Exercises

  1. Interactive model. Extend the DGP to the interactive model Y = g(D, X) + U where the treatment effect varies with X. Use the interactive DML model (IRM) instead of the partial linear model.

  2. Increase dimensionality. Set p = 500 (with n = 2000). OLS with all controls will fail entirely. Compare lasso DML with random forest DML in the high-dimensional regime.

  3. Misspecification. Use a linear learner (OLS or ridge) in the DML framework when the true g(X) is nonlinear. How much does functional form misspecification matter for the DML estimate?

  4. Multiple repetitions. Run the DML procedure 200 times with different random seeds and plot the distribution of theta_hat. Verify that the distribution is approximately normal and that the 95% CI has correct coverage.

  5. Partially linear IV. Extend the DGP to include an instrument Z and implement the DML-IV estimator (PLIV model from the DoubleML package). Compare with standard 2SLS.

  6. Neural network learner. Replace lasso with a neural network as the ML learner in DML. Does the DML estimate remain unbiased? How do computation time and variance compare?

  7. Sensitivity to sparsity. Vary the number of active confounders (currently 5) from 1 to 50. At what point does lasso DML start to degrade relative to random forest DML?

  8. Honest confidence intervals. Implement the DML confidence interval and compare with the bootstrap confidence interval. Verify that both achieve approximately 95% coverage.