Lab·replication·6 min read

replication120 minutes

Replication Lab: Double/Debiased Machine Learning

Replicate Chernozhukov et al. (2018) on double/debiased ML: simulate a high-dimensional DGP, show naive ML failure, and run the partial linear DML estimator.

Method: Double/Debiased Machine Learning (DML)
Languages: Python, R, Stata
Dataset: Simulated high-dimensional DGP matching Chernozhukov et al. (2018)

Overview

In this replication lab, you will explore the core methodology from one of the most important papers in modern causal inference:

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. "Double/Debiased Machine Learning for Treatment and Structural Parameters." The Econometrics Journal 21(1): C1–C68.

The central problem addressed by DML: when you use machine learning (lasso, random forests, etc.) to control for high-dimensional confounders, naive "plug-in" estimators are biased because regularization introduces systematic errors. The DML framework solves the bias problem through two key innovations: (1) Neyman orthogonalization (the "double" in double ML), which makes the estimator insensitive to small errors in nuisance parameter estimation, and (2) cross-fitting, which avoids overfitting bias when the same data are used to estimate nuisance parameters and the target parameter.

Why the Chernozhukov et al. (2018) paper matters: It provided a rigorous, general framework for combining machine learning with causal inference, enabling researchers to use flexible ML methods while retaining valid statistical inference about treatment effects.

What you will do:

Simulate a high-dimensional partially linear model
Estimate the treatment effect using naive OLS (omitting confounders)
Estimate using "naive ML" (plug-in lasso without orthogonalization)
Implement the DML estimator with cross-fitting
Run a Monte Carlo study demonstrating root-n consistency and correct coverage
Compare all estimators to the true treatment effect and the oracle

Step 1: Simulate a High-Dimensional Partially Linear Model

The partially linear model separates the treatment variable D from the high-dimensional controls X. The outcome Y depends on D (linearly) and on X (nonlinearly through g). The treatment D also depends on X (through m), creating confounding.

1# First-time setup: install.packages(c("DoubleML", "mlr3", "mlr3learners", "glmnet", "ranger"))
2library(DoubleML)
3library(mlr3)
4library(mlr3learners)
5library(glmnet)
6library(ranger)
7
8set.seed(2018)
9
10# DGP: partially linear model with high-dimensional confounders
11n <- 2000; p <- 100; theta_true <- 0.5
12
13# AR(1) correlated covariates to create realistic dependence structure
14X <- matrix(rnorm(n * p), n, p)
15rho <- 0.5
16for (j in 2:p) X[, j] <- rho * X[, j-1] + sqrt(1 - rho^2) * X[, j]
17colnames(X) <- paste0("X", 1:p)
18
19# g(X): nonlinear outcome confounding function (sparse: only 5 variables)
20g_X <- X[,1]^2 + sin(pi * X[,2]) + 2 * X[,3] * X[,4] + X[,5]^3 / 5
21# m(X): treatment confounding function (creates endogeneity)
22m_X <- 0.5 * X[,1] + 0.3 * X[,2]^2 - 0.2 * X[,3] + 0.4 * abs(X[,4])
23
24# Error terms (independent of X)
25U <- rnorm(n); V <- rnorm(n)
26# Treatment: depends on X through m(X) plus noise
27D <- m_X + V
28# Outcome: partially linear model Y = theta*D + g(X) + U
29Y <- theta_true * D + g_X + U
30
31df <- as.data.frame(X)
32df$D <- D; df$Y <- Y
33
34cat("n =", n, ", p =", p, ", true theta =", theta_true, "\n")

RequiresDoubleML mlr3 mlr3learners glmnet ranger

Expected output:

Sample size: n = 2000
Number of covariates: p = 100
True treatment effect: theta = 0.5

Outcome Y: mean = 0.54, sd = 2.15
Var(g(X)): ~3.5
Var(m(X)): ~0.8

Step 2: Naive OLS and Naive ML (Biased Approaches)

First, estimate the treatment effect with naive approaches that illustrate the problems DML solves.

1# Approach 1: OLS without controls (omitted variable bias)
2ols_nc <- lm(Y ~ D, data = df)
3cat("OLS (no controls):", round(coef(ols_nc)["D"], 4), "\n")
4
5# Approach 2: OLS with all 100 controls (feasible since p < n)
6fml <- as.formula(paste("Y ~ D +", paste0("X", 1:p, collapse = "+")))
7ols_all <- lm(fml, data = df)
8cat("OLS (all controls):", round(coef(ols_all)["D"], 4), "\n")
9
10# Approach 3: Naive ML — lasso for g(X) only, no orthogonalization
11# This is biased because residual error in g_hat is correlated with D
12X_mat <- as.matrix(df[, paste0("X", 1:p)])
13cv_g <- cv.glmnet(X_mat, Y, alpha = 1)
14g_hat <- predict(cv_g, X_mat, s = "lambda.min")
15Y_resid <- Y - g_hat
16naive_ml <- lm(as.numeric(Y_resid) ~ D)
17cat("Naive ML (lasso for g):", round(coef(naive_ml)["D"], 4), "\n")
18cat("True theta:", theta_true, "\n")

Expected output — Naive estimators:

Estimator	Estimate	Bias
OLS, no controls	~0.72	+0.22
OLS, all 100 controls	~0.51	+0.01
Naive ML (lasso for g only)	~0.62	+0.12
Naive RF (no cross-fitting)	~0.30	-0.20
True theta	0.50	—

OLS without controls suffers from omitted variable bias. The naive ML approach — using lasso to partial out X from Y but not from D — is biased because the regularization error in estimating g is correlated with D through the confounding function m. The naive RF approach with in-sample prediction is attenuated because in-sample overfitting makes residuals too small.

Concept Check

Why does the 'naive ML' approach (lasso for g only, without orthogonalization) produce a biased estimate of theta, even though lasso is a good estimator of g(X)?

Because lasso is a biased estimator.Because the regularization error in estimating g(X) is correlated with D through the confounders X. Without also residualizing D against X (the 'double' in double ML), the lasso error in g leaks into the treatment effect estimate.Because the sample size is too small for lasso to work.Because lasso assumes a linear model while g(X) is nonlinear.

Step 3: DML with Cross-Fitting

The DML estimator uses two key innovations: (1) orthogonalization — residualize both Y and D against X, and (2) cross-fitting — split the sample to avoid overfitting when using the same data for nuisance estimation and inference.

1# Set up data for the DoubleML package
2dml_data <- DoubleMLData$new(df, y_col = "Y", d_cols = "D",
3                            x_cols = paste0("X", 1:p))
4
5# DML with Lasso: cross-validated lasso for both nuisance models
6lasso_learner <- lrn("regr.cv_glmnet", s = "lambda.min")
7dml_lasso <- DoubleMLPLR$new(dml_data,
8                            ml_l = lasso_learner$clone(),  # For E[Y|X]
9                            ml_m = lasso_learner$clone(),  # For E[D|X]
10                            n_folds = 5)  # 5-fold cross-fitting
11dml_lasso$fit()
12cat("=== DML (Lasso) ===\n")
13print(dml_lasso$summary())
14
15# DML with Random Forest for comparison
16rf_learner <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
17dml_rf <- DoubleMLPLR$new(dml_data,
18                         ml_l = rf_learner$clone(),
19                         ml_m = rf_learner$clone(),
20                         n_folds = 5)
21dml_rf$fit()
22cat("\n=== DML (Random Forest) ===\n")
23print(dml_rf$summary())
24cat("True theta:", theta_true, "\n")

RequiresDoubleML

Expected output — DML estimates:

Estimator	theta	SE	95% CI
DML (Lasso)	~0.50	~0.03	[0.44, 0.56]
DML (Random Forest)	~0.51	~0.03	[0.45, 0.57]
True theta	0.50	—	—

Both DML estimators recover the true treatment effect with minimal bias. The standard errors are valid for inference (the 95% CI covers the true value), and the estimates are robust to the choice of ML method.

Step 4: The Role of Cross-Fitting

Cross-fitting is essential for DML. Without cross-fitting, overfitting in the nuisance estimation contaminates the treatment effect estimate.

1# DML WITHOUT cross-fitting: uses full sample for nuisance estimation
2# In-sample predictions overfit, contaminating the treatment effect estimate
3g_hat_full <- predict(cv.glmnet(X_mat, Y), X_mat, s = "lambda.min")
4m_hat_full <- predict(cv.glmnet(X_mat, D), X_mat, s = "lambda.min")
5
6Y_r <- Y - g_hat_full
7D_r <- D - m_hat_full
8theta_no_cf <- sum(D_r * Y_r) / sum(D_r * D_r)
9
10cat("=== Cross-Fitting Effect ===\n")
11cat("With cross-fitting:", round(dml_lasso$coef, 4), "\n")
12cat("Without cross-fitting:", round(theta_no_cf, 4), "\n")
13cat("True theta:", theta_true, "\n")
14
15# Sensitivity analysis: vary the number of folds K
16for (K_test in c(2, 3, 5, 10)) {
17dml_k <- DoubleMLPLR$new(dml_data,
18                           ml_l = lasso_learner$clone(),
19                           ml_m = lasso_learner$clone(),
20                           n_folds = K_test)
21dml_k$fit()
22cat("K =", K_test, ": theta =", round(dml_k$coef, 4), "\n")
23}

Expected output — Cross-fitting comparison:

Method	theta	Bias
DML with cross-fitting (K=5)	~0.502	+0.002
DML without cross-fitting	~0.485	-0.015
True theta	0.500	—

Sensitivity to number of folds:

K	theta	Bias
2	~0.498	-0.002
3	~0.501	+0.001
5	~0.502	+0.002
10	~0.503	+0.003

Concept Check

What is the purpose of cross-fitting in DML, and why is it necessary even when using Neyman-orthogonal moment conditions?

Cross-fitting reduces the variance of the estimator.Cross-fitting prevents overfitting in the nuisance estimation from contaminating the treatment effect estimate. Without cross-fitting, the ML estimators of g and m overfit the training data, and the resulting residuals have biased second moments that leak into the theta estimate — even with orthogonalization.Cross-fitting is only needed for regularized estimators like lasso, not for random forests.Cross-fitting ensures that the standard errors are correct.

Step 5: Full Comparison and Oracle Benchmark

Compare all estimators, including the oracle that knows the true g(X) and m(X).

1# Oracle: uses the true nuisance functions g(X) and m(X)
2# This is the best possible estimator (infeasible in practice)
3Y_oracle <- Y - g_X   # = theta*D + U
4D_oracle <- D - m_X    # = V
5theta_oracle <- sum(D_oracle * Y_oracle) / sum(D_oracle^2)
6
7cat("=== Final Comparison ===\n")
8cat("No controls:", round(coef(ols_nc)["D"], 4), "\n")
9cat("OLS (100 linear):", round(coef(ols_all)["D"], 4), "\n")
10cat("Naive ML:", round(coef(naive_ml)["D"], 4), "\n")
11cat("DML (Lasso):", round(dml_lasso$coef, 4), "\n")
12cat("DML (RF):", round(dml_rf$coef, 4), "\n")
13cat("Oracle:", round(theta_oracle, 4), "\n")
14cat("True:", theta_true, "\n")

Expected output — Full comparison:

Estimator	Estimate	SE	Bias	95% CI Covers?
OLS, no controls	~0.72	~0.04	+0.22	No
OLS, all 100 controls	~0.51	~0.03	+0.01	Yes
Naive ML (lasso for g)	~0.62	—	+0.12	No
Naive RF (no cross-fitting)	~0.30	—	-0.20	No
DML, Lasso (K=5)	~0.50	~0.03	+0.00	Yes
DML, Random Forest (K=5)	~0.51	~0.03	+0.01	Yes
Oracle (true g, true m)	~0.50	~0.03	+0.00	Yes
True theta	0.50	—	—	—

Concept Check

Chernozhukov et al. (2018) require that nuisance estimators converge at a rate faster than n^{-1/4}. Why is the n^{-1/4} rate important?

The n^{-1/4} rate ensures that lasso selects the correct variables.The DML estimator's bias is proportional to the product of the convergence rates of the two nuisance estimators (`g_hat` and `m_hat`). If both converge at rate n^{-1/4} or faster, the product is n^{-1/2} or faster, which is the rate needed for root-n consistency and valid asymptotic normality.The n^{-1/4} rate is needed for the bootstrap to work.Faster convergence rates are generally infeasible with high-dimensional data.

Summary

The replication of Chernozhukov et al. (2018) demonstrates:

Regularization bias is real. Naive plug-in ML estimators produce biased estimates of treatment effects, even with good prediction performance.
Neyman orthogonalization addresses the regularization bias. By residualizing both Y and D against X (the "double" in DML), the estimator becomes insensitive to first-order errors in nuisance estimation.
Cross-fitting addresses overfitting contamination. Using separate sample splits for nuisance estimation and inference relaxes the need for restrictive Donsker conditions.
DML is robust across ML methods. Whether you use lasso, random forests, or other learners, the DML estimate converges to the true theta provided the chosen learner satisfies the required $o_p(n^{-1/4})$ convergence rate for each nuisance function and unconfoundedness holds — a key practical advantage.
Valid inference. DML provides asymptotically normal estimates with standard errors suitable for confidence intervals and hypothesis tests.

Extension Exercises

Interactive model. Extend the DGP to the interactive model Y = g(D, X) + U where the treatment effect varies with X. Use the interactive DML model (IRM) instead of the partial linear model.
Increase dimensionality. Set p = 500 (with n = 2000). OLS with all controls will fail entirely. Compare lasso DML with random forest DML in the high-dimensional regime.
Misspecification. Use a linear learner (OLS or ridge) in the DML framework when the true g(X) is nonlinear. How much does functional form misspecification matter for the DML estimate?
Multiple repetitions. Run the DML procedure 200 times with different random seeds and plot the distribution of theta_hat. Verify that the distribution is approximately normal and that the 95% CI has correct coverage.
Partially linear IV. Extend the DGP to include an instrument Z and implement the DML-IV estimator (PLIV model from the DoubleML package). Compare with standard 2SLS.
Neural network learner. Replace lasso with a neural network as the ML learner in DML. Does the DML estimate remain unbiased? How do computation time and variance compare?
Sensitivity to sparsity. Vary the number of active confounders (currently 5) from 1 to 50. At what point does lasso DML start to degrade relative to random forest DML?
Honest confidence intervals. Implement the DML confidence interval and compare with the bootstrap confidence interval. Verify that both achieve approximately 95% coverage.

Overview#

Step 1: Simulate a High-Dimensional Partially Linear Model#

Step 2: Naive OLS and Naive ML (Biased Approaches)#

Step 3: DML with Cross-Fitting#

Step 4: The Role of Cross-Fitting#

Step 5: Full Comparison and Oracle Benchmark#

Summary#

Extension Exercises#

Overview

Step 1: Simulate a High-Dimensional Partially Linear Model

Step 2: Naive OLS and Naive ML (Biased Approaches)

Step 3: DML with Cross-Fitting

Step 4: The Role of Cross-Fitting

Step 5: Full Comparison and Oracle Benchmark

Summary

Extension Exercises