Guide·11 min read

Guide

When to Use DML Over Standard OLS

A practical comparison of OLS and Double/Debiased Machine Learning for causal inference: when OLS suffices and when DML flexibility is required.

Reading Time: ~11 min read · 9 sections · 2 interactive exercises

Two Ways to Estimate a Treatment Effect with Controls

The most common approach to estimating a causal effect in applied research is OLS regression with controls: regress the outcome on the treatment indicator and a set of covariates, then read off the treatment coefficient. OLS is simple, transparent, and well-understood. But OLS rests on a strong functional form assumption — the conditional expectation of the outcome given covariates is linear — and when the true relationship is more complex, the treatment effect estimate inherits the misspecification bias.

Double/Debiased Machine Learning (DML), introduced by Chernozhukov et al. (2018), replaces the linear control function with flexible machine learning while preserving valid inference on the treatment effect. The intellectual foundation goes back to Robinson (1988), who showed that a partially linear model can be estimated at $\sqrt{n}$ rates even when the nonparametric component converges slowly. DML extends Robinson's insight to the modern ML setting through two innovations: Neyman orthogonality and cross-fitting.

This guide compares the two approaches and helps you decide when the additional machinery of DML is warranted.

The Core Distinction: Linear vs. Flexible Nuisance Functions

Both OLS and DML estimate the same causal parameter in the partially linear model:

Y_i = \tau D_i + g_0(X_i) + U_i, \quad E[U_i \mid X_i, D_i] = 0

where $Y_i$ is the outcome, $D_i$ is treatment, $X_i$ is a vector of controls, and $g_0(\cdot)$ is the nuisance function capturing confounding.

OLS: Assume $g_0$ Is Linear

OLS assumes $g_0(X_i) = X_i'\beta$ , so you estimate:

Y_i = \tau D_i + X_i'\beta + \varepsilon_i

The coefficient $\hat{\tau}$ is consistent if the linear specification is correct. When $g_0$ is actually nonlinear, OLS misspecifies the control function, and the misspecification leaks into $\hat{\tau}$ through the correlation between $D_i$ and the omitted nonlinear terms.

DML: Learn $g_0$ Flexibly

DML does not assume a functional form for $g_0$ . Instead, DML also models the treatment assignment function $m_0(X_i) = E[D_i | X_i]$ and uses the double residualization procedure:

Estimate $\hat{\ell}(X) \approx E[Y | X]$ and $\hat{m}(X) \approx E[D | X]$ using any ML method (random forest, gradient boosting, LASSO).
Form residuals: $\tilde{Y}_i = Y_i - \hat{\ell}(X_i)$ and $\tilde{D}_i = D_i - \hat{m}(X_i)$ .
Regress $\tilde{Y}$ on $\tilde{D}$ to obtain $\hat{\tau}$ .

The "double" in DML refers to partialing out confounders from both the outcome and the treatment. Single residualization — partialing out $X$ from $Y$ only — is not Neyman-orthogonal and produces biased estimates when the ML learner is imperfect.

When OLS Is Fine

OLS is an excellent choice — and often the better choice — in the following settings:

Low-dimensional, well-understood controls. You have 5-20 covariates, each chosen based on theory or institutional knowledge, and you are confident the linear specification captures the relevant variation. Adding DML machinery provides no bias reduction and introduces additional estimation noise.

Approximately linear confounding. If the conditional expectation function $E[Y | X]$ is well-approximated by a linear function of $X$ , OLS and DML will produce nearly identical estimates. DML does not help when the problem DML solves (nonlinear confounding) does not exist.

Small samples. DML's guarantees are asymptotic. In small samples (a few hundred observations), ML learners may estimate the nuisance functions poorly, and cross-fitting reduces effective training set size by a factor of $(K-1)/K$ . OLS with a parsimonious specification may outperform DML in finite samples.

Transparency is paramount. OLS coefficients are directly interpretable, regression tables are standard, and readers can inspect every coefficient. For audiences and journals where interpretability dominates, OLS is a feature.

Clean identification from a research design. If treatment variation comes from an RCT, regression discontinuity, or a strong instrument, the primary concern is not the control function but the exogeneity of the design. Adding flexible ML-based controls on top of a clean design is unlikely to change the estimate meaningfully.

When DML Helps

DML provides meaningful advantages over OLS in the following settings:

High-dimensional controls. When the number of potential confounders is large (dozens to hundreds) — demographic variables, industry codes, geographic indicators, lagged outcomes, interaction terms — OLS either cannot include all of them (overfitting, multicollinearity) or the researcher must manually select a subset (introducing degrees of freedom). DML uses regularized ML to handle the full control set without biasing $\hat{\tau}$ .

Nonlinear confounding. When the relationship between covariates and the outcome involves interactions, threshold effects, or nonlinearities that you typically cannot specify in advance, OLS misspecifies $g_0$ and the misspecification contaminates $\hat{\tau}$ . DML allows flexible learners to discover the structure.

Many potential controls with uncertain relevance. When theory does not clearly dictate which of 200 available covariates belong in the model, DML sidesteps the variable selection problem. The ML learner selects or weights controls based on their predictive power, and the Neyman orthogonality + cross-fitting combination ensures valid inference despite the data-driven selection.

Post-selection inference is needed. If you use LASSO to select controls and then run OLS on the selected set, the standard errors do not account for the selection step and confidence intervals are too narrow. DML provides valid post-selection inference by construction.

Concept Check

You have 200 potential control variables, 5,000 observations, and a binary treatment. You run LASSO to select the 20 most important controls, then run OLS on the selected controls with standard confidence intervals. Is the inference valid?

Yes — LASSO selected the right controls, and post-LASSO OLS is standard.No — the confidence intervals do not account for the model selection step and will be too narrow.Yes — as long as the LASSO penalty was chosen by cross-validation.It depends on the sample size.

Head-to-Head Comparison

Feature	OLS with Controls	DML
Functional form of $g_0$	Linear (researcher specifies nonlinearities manually)	Flexible (learned by ML)
Number of controls	Low to moderate ( $p \ll n$ )	Low to high ( $p$ can approach $n$ )
Control selection	Researcher-specified	Data-driven (ML + cross-fitting)
Key assumption	Conditional independence assumption (CIA) + correct linear specification	CIA + mild ML convergence rates ( $n^{-1/4}$ )
Inference validity	Standard (if model correctly specified)	Valid even after ML-based control selection
Transparency	High (all coefficients interpretable)	Lower (nuisance functions are black boxes)
Computational cost	Trivial	Moderate (K-fold ML training)
Implementation	Every statistics package	Specialized packages (DoubleML, EconML)
Small sample performance	Good with parsimonious models	Can be poor if ML models overfit
Sensitivity to specification	High (which controls? which interactions?)	Lower (ML adapts to data)

Simulation: OLS vs. DML on the Same Data

The simulation below generates data with nonlinear confounding and compares OLS to DML. The true treatment effect is $\tau = 1.0$ . Confounders affect both treatment and outcome through nonlinear functions that OLS cannot capture with a linear specification.

1# --- Step 1: Load packages and set up simulation ---
2# ranger: fast random forest implementation for nuisance estimation
3library(ranger)
4set.seed(42)
5n <- 5000   # sample size
6p <- 20     # number of confounders
7
8# --- Step 2: Generate confounders (X matrix) ---
9X <- matrix(rnorm(n * p), nrow = n, ncol = p)
10colnames(X) <- paste0("X", 1:p)
11
12# --- Step 3: Define nonlinear confounding functions ---
13# g0(X) = E[Y | X] and m0(X) = E[D | X] — both are nonlinear
14# This is where OLS will fail: it cannot capture these relationships
15g0 <- sin(2 * X[, 1]) + X[, 2]^2 - X[, 3] * X[, 4] +
16    exp(-X[, 5]^2) + 0.5 * abs(X[, 6])
17m0 <- 0.5 * (tanh(X[, 1] + X[, 2]) + 0.3 * X[, 3]^2 -
18    0.2 * X[, 4] * X[, 5])
19
20# --- Step 4: Generate treatment and outcome ---
21tau_true <- 1.0                    # true causal effect
22D <- m0 + rnorm(n)                 # treatment (confounded by X)
23Y <- tau_true * D + g0 + rnorm(n)  # outcome (confounded by X)
24
25# --- Step 5: OLS with linear controls ---
26# OLS assumes g0 is linear in X — misspecification biases tau_hat
27df <- data.frame(Y = Y, D = D, X)
28ols_fit <- lm(Y ~ D + ., data = df)
29tau_ols <- coef(ols_fit)["D"]
30# HC1 robust SEs (standard in applied work)
31se_ols <- sqrt(sandwich::vcovHC(ols_fit, type = "HC1")["D", "D"])
32cat(sprintf("OLS:  tau = %.3f (SE = %.3f), 95%% CI = [%.3f, %.3f]\n",
33          tau_ols, se_ols, tau_ols - 1.96 * se_ols, tau_ols + 1.96 * se_ols))
34
35# --- Step 6: DML with K-fold cross-fitting ---
36# Cross-fitting prevents overfitting bias from the ML nuisance estimates
37K <- 5
38folds <- sample(rep(1:K, length.out = n))
39Y_res <- numeric(n)  # residualized outcome
40D_res <- numeric(n)  # residualized treatment
41
42for (k in 1:K) {
43test_idx <- which(folds == k)
44train_idx <- which(folds != k)
45train_df <- data.frame(X[train_idx, ])
46test_df <- data.frame(X[test_idx, ])
47
48# Outcome model: learn E[Y | X] with random forest
49rf_y <- ranger(y ~ ., data = data.frame(y = Y[train_idx], train_df),
50               num.trees = 300, min.node.size = 5)
51# Residualize: Y_tilde = Y - E[Y | X]
52Y_res[test_idx] <- Y[test_idx] - predict(rf_y, test_df)$predictions
53
54# Treatment model: learn E[D | X] with random forest
55rf_d <- ranger(y ~ ., data = data.frame(y = D[train_idx], train_df),
56               num.trees = 300, min.node.size = 5)
57# Residualize: D_tilde = D - E[D | X]
58D_res[test_idx] <- D[test_idx] - predict(rf_d, test_df)$predictions
59}
60
61# --- Step 7: Final stage — regress Y residuals on D residuals ---
62# This is the "Frisch-Waugh" step: OLS of Y_tilde on D_tilde
63tau_dml <- sum(D_res * Y_res) / sum(D_res^2)
64# Sandwich-based SE via the influence function (Neyman orthogonal score)
65psi <- D_res * (Y_res - tau_dml * D_res)
66V <- mean(psi^2) / (mean(D_res^2))^2
67se_dml <- sqrt(V / n)
68cat(sprintf("DML:  tau = %.3f (SE = %.3f), 95%% CI = [%.3f, %.3f]\n",
69          tau_dml, se_dml, tau_dml - 1.96 * se_dml, tau_dml + 1.96 * se_dml))
70
71# --- Step 8: Compare OLS and DML bias ---
72cat(sprintf("\nTrue tau = %.1f\n", tau_true))
73cat(sprintf("OLS bias = %.3f\n", tau_ols - tau_true))
74cat(sprintf("DML bias = %.3f\n", tau_dml - tau_true))

Requiresranger sandwich

In the simulation above, OLS is biased because the linear specification cannot capture the nonlinear confounding functions $g_0$ and $m_0$ . The misspecified residuals from the linear projection of $Y$ on $X$ retain systematic components that are correlated with $D$ , pulling $\hat{\tau}_{OLS}$ away from the true value. DML, using random forests for the nuisance functions, flexibly removes the nonlinear confounding from both sides and recovers an estimate close to $\tau = 1.0$ .

Common Misconceptions

"DML always beats OLS"

Not true. When the true data-generating process is linear and the control set is small, OLS is unbiased, efficient, and transparent. DML introduces additional variance from the ML estimation and cross-fitting steps without any corresponding bias reduction. In low-dimensional linear settings, OLS is hard to beat.

"DML handles omitted variable bias"

Not true. DML addresses functional form misspecification and high-dimensional observed controls, but the conditional independence assumption is the same for DML and OLS. If important confounders are unobserved, DML is biased for exactly the same reason OLS is biased. DML solves the included variables problem (how to control for many observed variables flexibly), not the omitted variables problem.

"DML is a black box"

Partially true. The nuisance function estimates (a random forest or gradient boosted ensemble) are opaque, but the treatment effect estimate itself is a simple ratio — the OLS coefficient from regressing outcome residuals on treatment residuals. The black box is in the control function, not in the parameter of interest. You can and should inspect the ML models' predictive performance as a diagnostic.

"You need big data for DML"

Context-dependent. DML requires enough data for the ML learners to estimate the nuisance functions reasonably well and for cross-fitting to work without excessive variance. With a few hundred observations and complex nuisance functions, DML can be noisy. But with 2,000-5,000+ observations and moderate-dimensional controls, DML typically performs well. Most empirical economics datasets are large enough.

A Decision Framework

Use the following questions to guide your choice between OLS and DML:

1. How many potential controls do you have?

Few (5-15), each with a clear theoretical justification: OLS is natural.
Many (50+), with uncertain relevance: DML handles the dimensionality and selection.

2. Is the confounding relationship likely nonlinear?

Yes, or uncertain: DML can discover the structure. OLS requires you to specify interactions and nonlinearities manually.
No, approximately linear: OLS is simpler and equally valid.

3. How large is your sample?

Small ( $n < 500$ ): OLS with a parsimonious specification. DML's ML learners may not have enough data.
Moderate to large ( $n > 2{,}000$ ): DML is feasible and its asymptotic guarantees become relevant.

4. Do you need valid post-selection inference?

Yes (you are selecting controls from a large set): DML provides valid confidence intervals. Post-LASSO OLS does not.
No (the control set is fixed by theory): OLS inference is valid under correct specification.

5. What is your identification strategy?

Selection on observables with many covariates: DML is well-suited.
Clean research design (RCT, RDD, IV, DiD): the design provides identification, and adding ML-based controls is unlikely to change the estimate. Stick with OLS for transparency.

Practical Recommendations

Start with OLS as a transparency baseline. Even if you plan to use DML, running OLS first is widely recommended for transparency: it gives readers a familiar reference point, and coefficient stability between OLS and DML is informative. If the two estimates are similar, functional form concerns may be small. If they diverge, DML is plausibly capturing nonlinear confounding that OLS misses. (OLS as a "first cut" is a workflow convention, not a theorem — there are settings, such as binary outcomes with sparse covariates or highly nonlinear DGPs, where OLS is not the most natural baseline.)

Report both. Present the OLS estimate alongside the DML estimate. Agreement strengthens credibility; disagreement invites investigation into what DML is capturing that OLS cannot.

Choose ML learners thoughtfully. DML does not prescribe which ML method to use. Random forests, gradient boosting, LASSO, and neural networks all work. In practice, ensemble methods or stacking (averaging predictions from multiple learners) often perform well. Report which ML methods you used and how you tuned the learners.

Use roughly five cross-fitting folds. Five folds (sometimes called $K = 5$ ) is the most common default in the DML literature, mirroring conventions from cross-validation more broadly. Too few folds leave small training sets that may compromise ML performance; too many folds increase computational cost without meaningful gain. The "right" $K$ is not a theorem — Chernozhukov et al. (2018) note that any fixed $K \geq 2$ delivers the asymptotic guarantees, and some applications use $K = 10$ when training sets would otherwise be too small.

Check first-stage prediction quality. Report the cross-validated $R^2$ for both the outcome model and the treatment model. If neither model predicts well, DML cannot effectively remove confounding, and the estimates will be noisy and may degenerate toward OLS.

Concept Check

You run OLS with 10 hand-selected controls and get a treatment effect of 0.12 (SE = 0.04). You then run DML with 200 covariates and get a treatment effect of 0.05 (SE = 0.03). What is the most likely explanation for the discrepancy?

DML is less powerful and shrinks the estimate toward zero.The 10 hand-selected controls in OLS did not adequately control for confounding, and OLS overstates the treatment effect because of residual confounding from the omitted covariates.DML overfit the nuisance functions, absorbing real treatment variation into the controls.Random variation — the two estimates are within confidence intervals of each other.

Two Ways to Estimate a Treatment Effect with Controls#

The Core Distinction: Linear vs. Flexible Nuisance Functions#

OLS: Assume g0g_0g0​ Is Linear#

DML: Learn g0g_0g0​ Flexibly#

When OLS Is Fine#

When DML Helps#

Head-to-Head Comparison#

Simulation: OLS vs. DML on the Same Data#

Common Misconceptions#

"DML always beats OLS"#

"DML handles omitted variable bias"#

"DML is a black box"#

"You need big data for DML"#

A Decision Framework#

Practical Recommendations#