MethodAtlas
tutorial120 minutes

Lab: Double/Debiased Machine Learning

Estimate causal effects in the presence of many confounders using double/debiased machine learning. Learn why naive ML prediction fails for causal inference, how cross-fitting solves the overfitting problem, and how to implement the Chernozhukov et al. (2018) DML estimator.

Overview

In this lab you will estimate the causal effect of a binary treatment on an outcome when there are many potential confounders. You will see why simply plugging machine learning predictions into a regression produces biased estimates, and how the DML framework of Chernozhukov et al. (2018) solves this problem using cross-fitting and Neyman orthogonality.

What you will learn:

  • Why naive ML-based adjustment produces biased causal estimates (regularization bias)
  • How cross-fitting prevents overfitting from contaminating causal inference
  • How to implement the partially linear model (PLR) and interactive regression model (IRM) using DML
  • How to compare DML with OLS and naive ML approaches
  • How to conduct sensitivity analysis for unobserved confounding

Prerequisites: Familiarity with OLS regression and basic machine learning concepts (random forests, cross-validation). Understanding of the potential outcomes framework is helpful.


Step 1: Simulate High-Dimensional Data

We create an observational dataset with 20 covariates, where treatment depends on a subset of them and the outcome depends on a partially overlapping subset.

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(ranger)

set.seed(42)
n <- 3000; p <- 20

X <- matrix(rnorm(n * p), nrow = n)
colnames(X) <- paste0("X", 1:p)

logit_e <- 0.5 * X[,1] - 0.3 * X[,2]^2 + 0.4 * X[,3] * X[,4] + 0.2 * X[,5]
e_true <- plogis(logit_e)
D <- rbinom(n, 1, e_true)

g0 <- 2 * X[,1] + X[,2]^2 - 1.5 * X[,3] + 0.5 * X[,4] * X[,5] +
    0.8 * sin(X[,6]) - 0.6 * X[,7] + 0.3 * X[,8]
Y <- 2.0 * D + g0 + rnorm(n)

df <- as.data.frame(X)
df$D <- D; df$Y <- Y

cat("True ATE: 2.0\n")
cat("Naive diff:", mean(Y[D == 1]) - mean(Y[D == 0]), "\n")

Expected output:

StatisticValue
Sample size3,000
Number of covariates20
Treatment rate~45–55%
True ATE2.000
Naive difference in means~2.5–3.5 (biased upward)

Sample data preview (first 5 rows):

X1X2X3X4X5...DY
0.50-0.140.651.52-0.23...15.82
-0.140.77-0.46-0.190.31...01.03
0.65-0.471.010.39-0.81...14.21
1.520.54-0.320.150.68...18.37
-0.23-1.150.23-0.780.42...0-2.15

Summary statistics:

VariableMean (D=1)Mean (D=0)Difference
Y~4.5~1.5~3.0 (biased)
X1~0.25~-0.20Imbalanced
X2~-0.05~0.05Slight imbalance

The naive difference in means exceeds the true ATE of 2.0 because treated units tend to have higher values of X1 (which directly increases Y through g0).


Step 2: Why Naive ML Fails

A natural but wrong approach: use ML to predict Y from (D, X), then read off the coefficient on D.

# OLS with linear controls
ols <- lm(Y ~ D + ., data = df)
cat("OLS:", coef(ols)["D"], "\n")

# Naive RF approach
library(ranger)
rf_naive <- ranger(Y ~ ., data = df, num.trees = 200)
df_d1 <- df; df_d1$D <- 1
df_d0 <- df; df_d0$D <- 0
naive_ml <- mean(predict(rf_naive, df_d1)$predictions -
               predict(rf_naive, df_d0)$predictions)
cat("Naive RF:", naive_ml, "\n")
cat("True: 2.0\n")
Requiresranger

Expected output:

ApproachEstimateTrue ATEBias
OLS (linear controls)~1.8–2.22.0Small (misses nonlinear g0)
Lasso (linear)~1.7–2.12.0Some regularization bias
Naive RF "effect"~1.5–1.92.0Attenuated toward zero

OLS with linear controls can perform reasonably well here but misses the nonlinear terms in g0 (X2-squared, X3*X4 interaction, sin(X6)). The naive RF approach suffers from regularization bias — the treatment "coefficient" is shrunk toward zero because the random forest does not distinguish between the causal variable D and the confounders X.

Concept Check

You use a random forest to predict Y from (D, X) and compute the average prediction difference between D=1 and D=0. This prediction-based approach gives a biased estimate of the ATE. Which of the following explains the bias?


Step 3: The DML Estimator with Cross-Fitting

DML uses two key ideas: (1) Neyman orthogonality to remove sensitivity to nuisance parameter estimation, and (2) cross-fitting to prevent overfitting.

# DML using the DoubleML package
# Prepare data object
dml_data <- DoubleMLData$new(df, y_col = "Y", d_cols = "D",
                            x_cols = paste0("X", 1:p))

# Choose ML learners
ml_l <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
ml_m <- lrn("regr.ranger", num.trees = 200, max.depth = 10)

# Partially Linear Model
dml_plr <- DoubleMLPLR$new(dml_data, ml_l, ml_m, n_folds = 5)
dml_plr$fit()
print(dml_plr)

cat("\nDML estimate:", dml_plr$coef, "\n")
cat("SE:", dml_plr$se, "\n")
cat("True: 2.0\n")
RequiresDoubleML

Expected output:

DML Partially Linear ModelValue
beta_hat (ATE)~1.85–2.15
Standard error~0.06–0.10
95% CI lower~1.75
95% CI upper~2.25
True ATE2.000
Covers true value?Yes

The DML estimate should be close to the true ATE of 2.0, with a valid confidence interval that covers the true value. Unlike the naive approaches, DML correctly handles the nonlinear confounding through cross-fitted ML residualization.


Step 4: Compare DML with OLS

# Multiple DML repetitions
dml_plr_multi <- DoubleMLPLR$new(dml_data, ml_l, ml_m, n_folds = 5, n_rep = 50)
dml_plr_multi$fit()

cat("DML (median over 50 reps):", median(dml_plr_multi$all_coef), "\n")
cat("OLS:", coef(ols)["D"], "\n")
cat("True: 2.0\n")

# Confidence intervals
confint(dml_plr_multi)

Expected output:

EstimatorEstimateSD across splits
DML (median over 50 reps)~2.00~0.06
OLS (linear controls)~1.9–2.1
True ATE2.000

Step 5: Interactive Regression Model (IRM)

The IRM variant is useful when the treatment effect may be heterogeneous. It uses an AIPW-style score function.

# IRM variant in DoubleML
ml_g_irm <- lrn("regr.ranger", num.trees = 200, max.depth = 10)
ml_m_irm <- lrn("classif.ranger", num.trees = 200, max.depth = 10)

dml_irm <- DoubleMLIRM$new(dml_data, ml_g_irm, ml_m_irm, n_folds = 5)
dml_irm$fit()
print(dml_irm)

cat("IRM estimate:", dml_irm$coef, "\n")
cat("PLR estimate:", dml_plr$coef, "\n")
RequiresDoubleML

Expected output:

DML-IRM (AIPW)Value
ATE estimate~1.85–2.15
Standard error~0.06–0.10
95% CI lower~1.75
95% CI upper~2.25
True ATE2.000
Comparison of EstimatorsEstimate
Naive difference~2.5–3.5 (biased)
OLS (linear controls)~1.8–2.2
Naive RF~1.5–1.9 (attenuated)
DML-PLR~1.9–2.1
DML-IRM (AIPW)~1.9–2.1
True ATE2.000

The IRM estimate uses an AIPW-style score function, which is doubly robust: it is consistent if either the outcome model or the propensity score model is correctly specified. In this DGP with a constant treatment effect, the PLR and IRM estimates should be similar.

Concept Check

DML uses cross-fitting (sample splitting). Why not just split the data in half, estimate nuisance functions on one half, and estimate the causal parameter on the other?


Exercises

  1. Try different ML learners. Replace random forests with gradient boosting, LASSO, or neural networks for the nuisance functions. How sensitive is the DML estimate to the choice of learner?

  2. Vary the number of confounders. Increase p to 100 (with only 8 truly relevant) and add irrelevant noise variables. Does DML still work?

  3. Group ATE. Estimate group-specific treatment effects (e.g., for above-median vs. below-median X1) using DML. Compare with a linear interaction model.

  4. Sensitivity analysis. Implement the Chernozhukov et al. (2022) sensitivity analysis framework to assess how robust the DML estimate is to unobserved confounding.


Summary

In this lab you learned:

  • Naive ML adjustment for confounders produces biased causal estimates due to regularization bias and overfitting
  • DML overcomes these problems using Neyman-orthogonal score functions and K-fold cross-fitting
  • The partially linear model (PLR) assumes a constant additive treatment effect; the interactive regression model (IRM) allows for heterogeneous effects
  • Cross-fitting ensures all observations are used efficiently while preventing overfitting bias
  • Multiple random splits (repetitions) reduce sensitivity to any particular fold assignment
  • DML is a general framework that works with any ML method (random forests, LASSO, neural networks, boosting) for nuisance estimation