DML vs. OLS: When Machine Learning Improves Causal Estimates
A practical comparison of OLS regression and Double/Debiased Machine Learning for causal inference. When does OLS suffice, and when do you need the flexibility of DML?
Two Ways to Estimate a Treatment Effect with Controls
(Chernozhukov et al., 2018) (Robinson, 1988)The most common approach to estimating a causal effect in applied research is OLS regression with controls: regress the outcome on the treatment indicator and a set of covariates, then read off the treatment coefficient. OLS is simple, transparent, and well-understood. But OLS rests on a strong functional form assumption — the conditional expectation of the outcome given covariates is linear — and when the true relationship is more complex, the treatment effect estimate inherits the misspecification bias.
Double/Debiased Machine Learning (DML), introduced by , replaces the linear control function with flexible machine learning while preserving valid inference on the treatment effect. The intellectual foundation goes back to Robinson (1988), who showed that a partially linear model can be estimated at rates even when the nonparametric component converges slowly. DML extends Robinson's insight to the modern ML setting through two innovations: Neyman orthogonality and cross-fitting.
This guide compares the two approaches and helps you decide when the additional machinery of DML is warranted.
The Core Distinction: Linear vs. Flexible Nuisance Functions
Both OLS and DML estimate the same causal parameter in the partially linear model:
where is the outcome, is treatment, is a vector of controls, and is the nuisance function capturing confounding.
OLS: Assume Is Linear
OLS assumes , so you estimate:
The coefficient is consistent if the linear specification is correct. When is actually nonlinear, OLS misspecifies the control function, and the misspecification leaks into through the correlation between and the omitted nonlinear terms.
DML: Learn Flexibly
DML does not assume a functional form for . Instead, DML also models the treatment assignment function and uses the double residualization procedure:
- Estimate and using any ML method (random forest, gradient boosting, LASSO).
- Form residuals: and .
- Regress on to obtain .
The "double" in DML refers to partialing out confounders from both the outcome and the treatment. Single residualization — partialing out from only — is not Neyman-orthogonal and produces biased estimates when the ML learner is imperfect.
When OLS Is Fine
OLS is an excellent choice — and often the better choice — in the following settings:
Low-dimensional, well-understood controls. You have 5-20 covariates, each chosen based on theory or institutional knowledge, and you are confident the linear specification captures the relevant variation. Adding DML machinery provides no bias reduction and introduces additional estimation noise.
Approximately linear confounding. If the conditional expectation function is well-approximated by a linear function of , OLS and DML will produce nearly identical estimates. DML does not help when the problem DML solves (nonlinear confounding) does not exist.
Small samples. DML's guarantees are asymptotic. In small samples (a few hundred observations), ML learners may estimate the nuisance functions poorly, and cross-fitting reduces effective training set size by a factor of . OLS with a parsimonious specification may outperform DML in finite samples.
Transparency is paramount. OLS coefficients are directly interpretable, regression tables are standard, and readers can inspect every coefficient. For audiences and journals where interpretability dominates, OLS is a feature.
Clean identification from a research design. If treatment variation comes from an RCT, regression discontinuity, or a strong instrument, the primary concern is not the control function but the exogeneity of the design. Adding flexible ML-based controls on top of a clean design is unlikely to change the estimate meaningfully.
When DML Helps
DML provides meaningful advantages over OLS in the following settings:
High-dimensional controls. When the number of potential confounders is large (dozens to hundreds) — demographic variables, industry codes, geographic indicators, lagged outcomes, interaction terms — OLS either cannot include all of them (overfitting, multicollinearity) or the researcher must manually select a subset (introducing degrees of freedom). DML uses regularized ML to handle the full control set without biasing .
Nonlinear confounding. When the relationship between covariates and the outcome involves interactions, threshold effects, or nonlinearities that you cannot specify in advance, OLS misspecifies and the misspecification contaminates . DML allows flexible learners to discover the structure.
Many potential controls with uncertain relevance. When theory does not clearly dictate which of 200 available covariates belong in the model, DML sidesteps the variable selection problem. The ML learner selects or weights controls based on their predictive power, and the Neyman orthogonality + cross-fitting combination ensures valid inference despite the data-driven selection.
Post-selection inference is needed. If you use LASSO to select controls and then run OLS on the selected set, the standard errors do not account for the selection step and confidence intervals are too narrow. DML provides valid post-selection inference by construction.
You have 200 potential control variables, 5,000 observations, and a binary treatment. You run LASSO to select the 20 most important controls, then run OLS on the selected controls with standard confidence intervals. Is the inference valid?
Head-to-Head Comparison
| Feature | OLS with Controls | DML |
|---|---|---|
| Functional form of | Linear (researcher specifies nonlinearities manually) | Flexible (learned by ML) |
| Number of controls | Low to moderate () | Low to high ( can approach ) |
| Control selection | Researcher-specified | Data-driven (ML + cross-fitting) |
| Key assumption | CIA + correct linear specification | CIA + mild ML convergence rates () |
| Inference validity | Standard (if model correctly specified) | Valid even after ML-based control selection |
| Transparency | High (all coefficients interpretable) | Lower (nuisance functions are black boxes) |
| Computational cost | Trivial | Moderate (K-fold ML training) |
| Implementation | Every statistics package | Specialized packages (DoubleML, EconML) |
| Small sample performance | Good with parsimonious models | Can be poor if ML models overfit |
| Sensitivity to specification | High (which controls? which interactions?) | Lower (ML adapts to data) |
Simulation: OLS vs. DML on the Same Data
The simulation below generates data with nonlinear confounding and compares OLS to DML. The true treatment effect is . Confounders affect both treatment and outcome through nonlinear functions that OLS cannot capture with a linear specification.
library(ranger)
set.seed(42)
n <- 5000
p <- 20
# Generate confounders
X <- matrix(rnorm(n * p), nrow = n, ncol = p)
colnames(X) <- paste0("X", 1:p)
# Nonlinear confounding functions
g0 <- sin(2 * X[, 1]) + X[, 2]^2 - X[, 3] * X[, 4] +
exp(-X[, 5]^2) + 0.5 * abs(X[, 6])
m0 <- 0.5 * (tanh(X[, 1] + X[, 2]) + 0.3 * X[, 3]^2 -
0.2 * X[, 4] * X[, 5])
# Generate treatment and outcome
tau_true <- 1.0
D <- m0 + rnorm(n)
Y <- tau_true * D + g0 + rnorm(n)
# --- OLS ---
df <- data.frame(Y = Y, D = D, X)
ols_fit <- lm(Y ~ D + ., data = df)
tau_ols <- coef(ols_fit)["D"]
se_ols <- sqrt(sandwich::vcovHC(ols_fit, type = "HC1")["D", "D"])
cat(sprintf("OLS: tau = %.3f (SE = %.3f), 95%% CI = [%.3f, %.3f]\n",
tau_ols, se_ols, tau_ols - 1.96 * se_ols, tau_ols + 1.96 * se_ols))
# --- DML with cross-fitting ---
K <- 5
folds <- sample(rep(1:K, length.out = n))
Y_res <- numeric(n)
D_res <- numeric(n)
for (k in 1:K) {
test_idx <- which(folds == k)
train_idx <- which(folds != k)
train_df <- data.frame(X[train_idx, ])
test_df <- data.frame(X[test_idx, ])
# Outcome model: E[Y | X]
rf_y <- ranger(y ~ ., data = data.frame(y = Y[train_idx], train_df),
num.trees = 300, min.node.size = 5)
Y_res[test_idx] <- Y[test_idx] - predict(rf_y, test_df)$predictions
# Treatment model: E[D | X]
rf_d <- ranger(y ~ ., data = data.frame(y = D[train_idx], train_df),
num.trees = 300, min.node.size = 5)
D_res[test_idx] <- D[test_idx] - predict(rf_d, test_df)$predictions
}
# Final stage: regress Y residuals on D residuals
tau_dml <- sum(D_res * Y_res) / sum(D_res^2)
psi <- D_res * (Y_res - tau_dml * D_res)
V <- mean(psi^2) / (mean(D_res^2))^2
se_dml <- sqrt(V / n)
cat(sprintf("DML: tau = %.3f (SE = %.3f), 95%% CI = [%.3f, %.3f]\n",
tau_dml, se_dml, tau_dml - 1.96 * se_dml, tau_dml + 1.96 * se_dml))
cat(sprintf("\nTrue tau = %.1f\n", tau_true))
cat(sprintf("OLS bias = %.3f\n", tau_ols - tau_true))
cat(sprintf("DML bias = %.3f\n", tau_dml - tau_true))In the simulation above, OLS is biased because the linear specification cannot capture the nonlinear confounding functions and . The misspecified residuals from the linear projection of on retain systematic components that are correlated with , pulling away from the true value. DML, using random forests for the nuisance functions, flexibly removes the nonlinear confounding from both sides and recovers an estimate close to .
Common Misconceptions
"DML always beats OLS"
Not true. When the true data-generating process is linear and the control set is small, OLS is unbiased, efficient, and transparent. DML introduces additional variance from the ML estimation and cross-fitting steps without any corresponding bias reduction. In low-dimensional linear settings, OLS is hard to beat.
"DML handles omitted variable bias"
Not true. DML addresses functional form misspecification and high-dimensional observed controls, but the conditional independence assumption is the same for DML and OLS. If important confounders are unobserved, DML is biased for exactly the same reason OLS is biased. DML solves the included variables problem (how to control for many observed variables flexibly), not the omitted variables problem.
"DML is a black box"
Partially true. The nuisance function estimates (a random forest or gradient boosted ensemble) are opaque, but the treatment effect estimate itself is a simple ratio — the OLS coefficient from regressing outcome residuals on treatment residuals. The black box is in the control function, not in the parameter of interest. You can and should inspect the ML models' predictive performance as a diagnostic.
"You need big data for DML"
Context-dependent. DML requires enough data for the ML learners to estimate the nuisance functions reasonably well and for cross-fitting to work without excessive variance. With a few hundred observations and complex nuisance functions, DML can be noisy. But with 2,000-5,000+ observations and moderate-dimensional controls, DML typically performs well. Most empirical economics datasets are large enough.
A Decision Framework
Use the following questions to guide your choice between OLS and DML:
1. How many potential controls do you have?
- Few (5-15), each with a clear theoretical justification: OLS is natural.
- Many (50+), with uncertain relevance: DML handles the dimensionality and selection.
2. Is the confounding relationship likely nonlinear?
- Yes, or uncertain: DML can discover the structure. OLS requires you to specify interactions and nonlinearities manually.
- No, approximately linear: OLS is simpler and equally valid.
3. How large is your sample?
- Small (): OLS with a parsimonious specification. DML's ML learners may not have enough data.
- Moderate to large (): DML is feasible and its asymptotic guarantees become relevant.
4. Do you need valid post-selection inference?
- Yes (you are selecting controls from a large set): DML provides valid confidence intervals. Post-LASSO OLS does not.
- No (the control set is fixed by theory): OLS inference is valid under correct specification.
5. What is your identification strategy?
- Selection on observables with many covariates: DML is well-suited.
- Clean research design (RCT, RDD, IV, DiD): the design provides identification, and adding ML-based controls is unlikely to change the estimate. Stick with OLS for transparency.
Practical Recommendations
Start with OLS. Even if you plan to use DML, run OLS first as a baseline. Coefficient stability between OLS and DML is informative: if the estimates are similar, the functional form concerns may be small. If the estimates diverge, DML is capturing nonlinear confounding that OLS misses.
Report both. Present the OLS estimate alongside the DML estimate. Agreement strengthens credibility; disagreement invites investigation into what DML is capturing that OLS cannot.
Choose ML learners thoughtfully. DML does not prescribe which ML method to use. Random forests, gradient boosting, LASSO, and neural networks all work. In practice, ensemble methods or stacking (averaging predictions from multiple learners) often perform best. Report which ML methods you used and how you tuned the learners.
Use five cross-fitting folds. Five folds is the standard default. Too few folds leave small training sets that may compromise ML performance; too many folds increase computational cost without meaningful gain.
Check first-stage prediction quality. Report the cross-validated for both the outcome model and the treatment model. If neither model predicts well, DML cannot effectively remove confounding, and the estimates will be noisy and may degenerate toward OLS.
You run OLS with 10 hand-selected controls and get a treatment effect of 0.12 (SE = 0.04). You then run DML with 200 covariates and get a treatment effect of 0.05 (SE = 0.03). What is the most likely explanation for the discrepancy?