Double/Debiased Machine Learning (DML)
Uses machine learning for nuisance parameter estimation while preserving valid inference on the causal parameter of interest.
One-Line Implementation
DoubleMLPLR$new(dml_data, ml_l = lrn('regr.ranger'), ml_m = lrn('classif.ranger'), n_folds = 5)$fit()* ssc install ddml pystacked; multi-step: ddml init, then ddml E[Y|X] and ddml crossfitLinearDML(model_y=RandomForestRegressor(), model_t=RandomForestClassifier(), cv=5).fit(Y, T, X=X)Download Full Analysis Code
Complete scripts with diagnostics, robustness checks, and result export.
Motivating Example: Estimating Price Elasticity of Demand
You want to estimate the causal effect of price on demand for a product. You have observational data with hundreds of potential confounders: competitor prices, seasonality, weather, local demographics, marketing spend, and more.
Traditional regression forces you to specify a functional form: maybe you add log-transformed variables, interactions, and polynomials. But with hundreds of confounders, getting the functional form right is extremely difficult. Machine learning excels at flexibly fitting complex relationships — random forests, gradient boosting, and neural networks can capture nonlinearities and interactions that you would never think to specify.
But there is a catch. If you simply run a random forest to predict the outcome, extract the predicted values, and use them in a second-stage regression, your standard errors are wrong and your point estimate may be biased. ML models overfit, regularize in ways that bias coefficients, and do not distinguish between causal and predictive relationships.
Chernozhukov et al. (2018) addressed this problem with Double/ (DML). The key ideas are:
- Neyman orthogonality: Construct the causal estimating equation so that small errors in the ML-estimated nuisance functions do not bias the causal parameter.
- : Split the data to avoid overfitting bias — train ML models on one subset, predict on another.
The result: you can use any well-behaved ML method to control for confounders while still getting valid confidence intervals for the causal effect.
AOverview
The Problem with Naive ML
Suppose you want to estimate in:
where is an unknown, potentially complex function of high-dimensional confounders .
Naive approach: Estimate using ML, then regress on .
Problem: Even if converges to , ML methods typically converge at a rate slower than . This "regularization bias" contaminates the estimate of , making standard errors invalid and the point estimate biased.
The DML Solution
DML addresses this through two innovations:
1. Neyman orthogonality. Instead of directly partialing out confounders from , also partial out confounders from — that is, residualize both and on . This "double residualization" makes the estimating equation for insensitive to first-order errors in . The idea goes back to Robinson (1988) and Frisch-Waugh-Lovell, but DML generalizes it to the ML setting.
2. Cross-fitting. Split the sample into folds. For each fold, train the ML models on the other folds and predict on the held-out fold. This sample-splitting avoids the overfitting problem that arises when the same data are used for both ML estimation and causal inference.
(Plain Language)
Imagine you are estimating a treatment effect, and your estimate depends on how well you model the outcome as a function of confounders. If a small error in your outcome model directly translates into a proportional error in your treatment effect estimate, you have a problem — because ML models inevitably have some error.
means the treatment effect estimate is locally insensitive to errors in the nuisance models. Geometrically, the "gradient" of the causal parameter with respect to the nuisance function is zero at the true values. First-order errors in the nuisance function produce only second-order errors in the causal parameter.
This insensitivity property is why you partial out confounders from both the outcome and the treatment. The resulting estimating equation has the Neyman-orthogonal property.
Common Confusions
"Can I use any ML method?" Almost. The ML learners must satisfy certain convergence rate conditions (roughly, faster than ; more precisely, the product of the two nuisance estimation errors must be ). Most standard methods (random forests, gradient boosting, , neural networks) satisfy this condition. But consider avoiding methods that do not converge at all or that have extremely high variance.
"Does DML give me causal effects even with observational data?" DML gives you valid inference conditional on the unconfoundedness assumption, similar to matching methods. It does not solve the omitted variable problem. If there are unmeasured confounders, DML is biased just like OLS would be. DML's advantage is in handling observed confounders more flexibly.
"How is DML different from doubly robust estimation?" The doubly robust property is a building block of DML. DML adds (to handle ML overfitting) and the formal Neyman orthogonality framework (to handle regularization bias). You can think of DML as "doubly robust estimation done right when using ML."
"How many folds for cross-fitting?" 5 folds is a common default. Too few folds means the training set is small (reducing ML performance). Too many folds means each held-out set is small (increasing variance). The theoretical results are robust to the number of folds as long as .
BIdentification
The Partially Linear Model
The simplest DML setup is the partially linear regression model:
where:
- is the causal parameter of interest
- captures the confounding relationship between and (nuisance)
- is the treatment confounding function (nuisance / propensity score)
- are residuals
Identifying Assumptions
-
Conditional exogeneity (unconfoundedness): — conditional on the observed covariates , the treatment is as-good-as randomly assigned. All confounders are observed and included in .
-
Overlap (positivity): for all in the support of — every unit has a positive probability of receiving treatment. Without overlap, the propensity score would be degenerate and residualization would fail.
-
SUTVA: No interference between units — one unit's treatment does not affect another unit's outcome.
-
Sufficient ML convergence: The nuisance estimators and converge at a rate faster than , so that the product of their errors is .
The DML Estimator
Step 1: Double residualization.
- Estimate and using ML. Here is the reduced-form conditional expectation — we estimate this directly rather than because depends on the unknown .
- Compute residuals: and
Step 2: Estimate .
This estimator is just OLS of on — the Frisch-Waugh-Lovell theorem applied to ML residuals.
Step 3: Cross-fitting. Do the above with sample splitting:
- Split data into folds
- For each fold : train and on all data except fold ; predict for observations in fold
- Pool all residuals and run the final regression
Formal Statement
CVisual Intuition
The DML procedure can be visualized in three steps:
-
Partial out X from Y: Remove the part of the outcome explained by confounders (the ML-predicted component). What remains () is the variation in not explained by .
-
Partial out X from D: Remove the part of the treatment explained by confounders. What remains () is the variation in treatment not predicted by observables — the "residual" or "exogenous" variation.
-
Regress on : The slope of this regression is the causal effect .
DMathematical Derivation
Don't worry about the notation yet — here's what this means in words: The DML estimating equation is constructed so that the score function has a zero derivative with respect to the nuisance parameters at their true values. This orthogonality means small ML estimation errors produce only second-order bias.
Define the moment condition for the partially linear model:
where are the nuisance functions and .
Standard moment condition:
Neyman orthogonality condition: The Gateaux derivative of with respect to at is zero:
For our moment condition:
where is the outcome residual and is the treatment residual:
The first term is zero because . The second term is zero because .
This orthogonality means that is robust to first-order perturbations of and around their true values. Combined with cross-fitting (which prevents overfitting bias), this combination yields:
giving -consistency and asymptotic normality.
EImplementation
# Requires: DoubleML, mlr3, mlr3learners
# DoubleML: R implementation of Chernozhukov et al. (2018) DML framework
library(DoubleML)
library(mlr3) # mlr3: machine learning framework for R
library(mlr3learners) # mlr3learners: additional ML algorithms for mlr3
# --- Step 1: Prepare data for DML ---
# DoubleMLData specifies the causal structure: outcome, treatment, and confounders
dml_data <- DoubleMLData$new(
df,
y_col = "outcome", # outcome variable
d_cols = "treatment", # treatment variable
x_cols = paste0("x", 1:50) # high-dimensional confounders
)
# --- Step 2: Choose ML learners for nuisance estimation ---
# ml_l: predicts E[Y|X] (outcome model) — removes confounding from outcome
# ml_m: predicts E[D|X] (treatment model) — removes confounding from treatment
ml_l <- lrn("regr.ranger", num.trees = 500) # random forest for outcome
ml_m <- lrn("classif.ranger", num.trees = 500, predict_type = "prob") # random forest for propensity
# --- Step 3: Fit DML with cross-fitting ---
# DoubleMLPLR: partially linear model Y = theta*D + g(X) + epsilon
# n_folds = 5: 5-fold cross-fitting prevents overfitting bias
# Cross-fitting is essential — without it, ML overfitting biases theta
dml_plr <- DoubleMLPLR$new(
dml_data,
ml_l = ml_l,
ml_m = ml_m,
n_folds = 5
)
dml_plr$fit()
dml_plr$summary()
# theta: debiased causal effect of treatment on outcome
# --- Step 4: Confidence interval ---
# Valid inference from Neyman orthogonality + cross-fitting
dml_plr$confint(level = 0.95)FDiagnostics
-
Check the quality of the first-stage ML models. Report cross-validated or MSE for both the outcome model () and the treatment model (). If the ML models do not fit well, the residualization may not adequately remove confounding.
-
Compare ML learners. Run DML with different ML methods (random forest, gradient boosting, lasso) and check whether the causal estimate changes. Robustness to the ML learner choice increases credibility.
-
Check residual balance. After double residualization, the residualized treatment should be uncorrelated with . Check this by regressing on — the should be near zero.
-
Sensitivity to the number of folds. Re-run with folds. Results should be stable.
-
Compare with simple OLS. If DML and OLS give similar results, the confounding is well-captured by linear terms and DML's flexibility was not needed (but also did not hurt).
Interpreting Your Results
DML and OLS agree: The confounding relationship is approximately linear. Both are valid. DML's standard errors may be somewhat wider, reflecting additional estimation variance from the cross-fitting procedure.
DML and OLS disagree: Nonlinear confounding matters. Report DML as your main result and discuss why the linear approximation fails.
DML results vary across ML learners: The nuisance functions may not be well-estimated. Consider using ensemble methods or SuperLearner to aggregate multiple ML methods.
GWhat Can Go Wrong
Omitting Cross-Fitting: Overfitting Bias Corrupts Inference
Use 5-fold cross-fitting: train the ML models on 4 folds, predict residuals on the held-out fold. Repeat for all folds, then regress residualized outcome on residualized treatment.
DML with cross-fitting: theta = 0.15 (SE = 0.04, 95% CI [0.07, 0.23]). Coverage in simulations: 94.6%. Valid inference.
Single Residualization: Missing the 'Double' in DML
Partial out confounders from BOTH the outcome Y and the treatment D using ML. Regress the residualized outcome on the residualized treatment (double residualization / Frisch-Waugh-Lovell with ML).
DML estimate: 0.15 (SE = 0.04). The double residualization removes confounding from both sides, yielding a Neyman-orthogonal estimating equation that is insensitive to first-order ML errors.
Weak First-Stage ML Models: Residualization Fails to Remove Confounding
Use well-tuned ML models (random forest with 500 trees, gradient boosting with cross-validated hyperparameters) for both nuisance functions. Verify cross-validated R-squared is reasonable.
Cross-validated R-squared: 0.65 for outcome model, 0.40 for treatment model. DML estimate: 0.15 (SE = 0.04). Residualized treatment is uncorrelated with covariates (R-squared < 0.01).
HPractice
A researcher runs DML to estimate the effect of advertising on sales, using 200 covariates. The DML estimate is 0.15 (SE = 0.04). The OLS estimate with all 200 covariates is 0.22 (SE = 0.03). What is the most likely explanation for the difference?
What is the purpose of the 'double' in Double/Debiased Machine Learning?
Double Machine Learning: Estimating the Effect of R&D Spending on Firm Productivity
An economist wants to estimate the causal effect of R&D investment (D) on total factor productivity (Y) across 5,000 firms. The challenge is that 180 potential confounders (industry conditions, firm age, market concentration, prior performance) affect both R&D decisions and productivity. She uses DML with a random forest for both the outcome model and the treatment model, with 5-fold cross-fitting.
Read the analysis below carefully and identify the errors.
A marketing researcher estimates the causal effect of digital advertising spending on sales using DML with 150 covariates (competitor prices, seasonality dummies, weather, demographics). She uses a random forest for the outcome model and logistic regression for the treatment model (advertising is binarized as high/low). She reports: "DML estimate: 12% sales increase per ad campaign (SE = 2.1%, p < 0.001). Cross-validated R-squared for the outcome model is 0.78." She writes: "Our DML approach controls for high-dimensional confounders using machine learning, providing a credible causal estimate." She does not report the treatment model's performance.
Select all errors you can find:
Read the analysis below carefully and identify the errors.
An applied microeconomist uses DML to estimate the return to college education on earnings, using 80 covariates from census data. He uses gradient boosting for both nuisance models with 5-fold cross-fitting. He reports: "DML estimate: $8,200 annual earnings premium (SE = $450). OLS estimate: $12,500 (SE = $380). The DML estimate is 34% smaller, demonstrating that nonlinear confounding substantially inflates the OLS estimate." He writes: "Since DML handles high-dimensional nonlinear confounding, our estimate represents the true causal return to education." He does not discuss potential unobserved confounders.
Select all errors you can find:
Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.
Paper Summary
The authors estimate the effect of corporate tax rate changes on firm investment using DML. They use firm-level panel data for 15,000 firms across 30 countries, with 300 covariates including financial ratios, industry indicators, and macroeconomic variables. They report a DML estimate of -0.45 (a 1 percentage point tax increase reduces investment by 0.45%). They compare with OLS (-0.32) and argue the difference shows OLS understates the negative effect due to nonlinear confounding.
Key Table
| Method | Estimate | SE | 95% CI |
|---|---|---|---|
| OLS | -0.32 | 0.05 | [-0.42, -0.22] |
| DML (RF) | -0.45 | 0.08 | [-0.61, -0.29] |
| DML (GBM) | -0.51 | 0.09 | [-0.69, -0.33] |
| DML (Lasso) | -0.38 | 0.07 | [-0.52, -0.24] |
Outcome model CV R²: 0.72 (RF), 0.75 (GBM), 0.58 (Lasso) Treatment model CV R²: 0.35 (RF), 0.38 (GBM), 0.22 (Lasso)
Authors' Identification Claim
DML controls for 300 confounders flexibly using ML, providing a credible estimate of the causal effect of taxes on investment. The cross-fitting procedure ensures valid inference.
ISwap-In: When to Use Something Else
- OLS with controls: When the number of controls is small, functional form is known, and there is no need for machine-learning flexibility.
- IV / 2SLS: When endogeneity cannot be addressed by conditioning on observables and a valid instrument is available — DML assumes conditional exogeneity.
- Matching: When a transparent matched-pair design is preferred over a regression-based approach, and the covariate space is moderate.
- Doubly robust estimation: When double robustness is desired but the parametric setting suffices — DR estimators share the same doubly-robust logic as DML but are typically applied without cross-fitting.
JReviewer Checklist
Critical Reading Checklist
Paper Library
Foundational (7)
Bach, P., Chernozhukov, V., Kurz, M. S., & Spindler, M. (2022). DoubleML – An Object-Oriented Implementation of Double Machine Learning in Python.
Bach and colleagues develop the DoubleML Python package, providing a user-friendly object-oriented implementation of the DML framework. The package supports partially linear, interactive, and instrumental variable models with a variety of machine learning methods for nuisance estimation. A companion R package is described separately.
Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on Treatment Effects after Selection among High-Dimensional Controls.
Belloni, Chernozhukov, and Hansen introduce the post-double-selection LASSO method for inference on treatment effects with many potential controls. This paper is a key precursor to DML, demonstrating how regularized selection in both the treatment and outcome equations can yield valid inference.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters.
Chernozhukov et al. introduce double/debiased machine learning (DML), showing how to combine Neyman orthogonality with cross-fitting to obtain root-n consistent and asymptotically normal estimates of low-dimensional causal parameters while using high-dimensional machine learning for nuisance functions. This paper provides the theoretical foundation for valid inference when first-stage estimation uses flexible ML methods that would otherwise invalidate standard asymptotic arguments. The cross-fitting procedure it introduces is now standard practice for any application combining ML prediction with causal parameter estimation.
Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., & Robins, J. M. (2022). Locally Robust Semiparametric Estimation.
Chernozhukov, Escanciano, Ichimura, Newey, and Robins develop locally robust semiparametric estimators that extend the DML framework, demonstrating how automatic debiasing with machine learning first-stage estimates can be applied broadly. Their approach yields root-n consistent estimates of causal and structural parameters even when nuisance functions are estimated with regularized machine learning methods.
Fan, Q., Hsu, Y.-C., Lieli, R. P., & Zhang, Y. (2022). Estimation of Conditional Average Treatment Effects with High-Dimensional Data.
Fan and colleagues propose nonparametric estimators for conditional average treatment effects in high-dimensional settings. Their approach uses machine learning to estimate nuisance functions in a first stage, then applies local linear regression for the CATE function of interest, with functional limit theory and multiplier-bootstrap uniform inference.
Robinson, P. M. (1988). Root-N-Consistent Semiparametric Regression.
Robinson develops the partially linear regression estimator that achieves root-n consistency for the parametric component by partialling out nonparametric nuisance functions. This paper provides the semiparametric foundation that DML generalizes to the machine learning setting.
Semenova, V., & Chernozhukov, V. (2021). Debiased Machine Learning of Conditional Average Treatment Effects and Other Causal Functions.
Semenova and Chernozhukov extend DML to estimate conditional average treatment effects (CATEs) and other causal functions, allowing researchers to characterize treatment effect heterogeneity. They provide inference methods for projections of the CATE onto interpretable subgroups.
Application (1)
Knaus, M. C., Lechner, M., & Strittmatter, A. (2021). Machine Learning Estimation of Heterogeneous Causal Effects: Empirical Monte Carlo Evidence.
Knaus, Lechner, and Strittmatter conduct an empirical Monte Carlo study benchmarking eleven causal machine learning estimators for heterogeneous treatment effects across 24 data-generating processes based on real labor market data. They find that no single estimator dominates across all settings, and that ensemble methods combining multiple learners perform well overall. The study provides practical guidance on when different CATE estimators (causal forests, DML-based methods, meta-learners) are most reliable.
Survey (2)
Athey, S., & Imbens, G. W. (2019). Machine Learning Methods That Economists Should Know About.
Athey and Imbens provide a broad survey of machine learning methods relevant to economists, covering supervised learning, unsupervised learning, matrix completion, and methods at the intersection of ML and causal inference including DML and causal forests. The paper explains when and why machine learning methods can improve both prediction and causal inference in economics. It serves as an accessible entry point for applied researchers seeking to understand the full landscape of ML tools available for economic applications.
Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach.
Mullainathan and Spiess provide an accessible introduction to supervised machine learning for economists, emphasizing how ML differs from classical parameter estimation and where prediction-oriented tools can be useful in empirical economics. The paper is a broad ML-for-economists survey, not a foundational paper on double/debiased machine learning specifically.