Replication Lab: Demystifying Double Robustness
Replicate the simulation from Kang and Schafer (2007) showing how the doubly robust (AIPW) estimator performs when one or both models are misspecified. Compare IPW, outcome regression, and AIPW across four scenarios of correct and incorrect model specification.
Overview
In this replication lab, you will reproduce the central simulation from a widely cited paper that critically examined doubly robust estimation:
Kang, Joseph D.Y., and Joseph L. Schafer. 2007. "Demystifying Double Robustness: A Simple Tutorial." Statistical Science 22(4): 523–539.
Kang and Schafer designed an influential simulation to systematically evaluate doubly robust (DR) estimators — specifically the augmented inverse probability weighting (AIPW) estimator — under four scenarios: (1) both propensity score and outcome models correct, (2) only the outcome model correct, (3) only the propensity score correct, and (4) both models incorrect. The paper showed that while AIPW delivers on its theoretical promise (consistency when at least one model is correct), practical performance can degrade when both models are wrong or when propensity scores are extreme.
Why the Kang and Schafer paper matters: It provided the first rigorous and accessible demonstration of what double robustness means in practice, including its limitations. The paper motivated subsequent developments in targeted learning (TMLE), entropy balancing, and cross-fitted AIPW.
What you will do:
- Simulate data following the Kang-Schafer DGP with true and transformed covariates
- Estimate the population mean under four model specification scenarios
- Compare IPW, outcome regression, and AIPW estimators
- Demonstrate the doubly robust property and its practical limitations
- Examine extreme weight diagnostics
Step 1: Simulate the Kang-Schafer Data
The DGP generates four latent covariates (Z1–Z4) that determine both treatment assignment and the outcome. The researcher observes only nonlinear transformations (X1–X4) of the latent variables, creating a controlled misspecification scenario.
set.seed(2007)
n <- 1000
Z1 <- rnorm(n); Z2 <- rnorm(n); Z3 <- rnorm(n); Z4 <- rnorm(n)
logit_ps <- -Z1 + 0.5 * Z2 - 0.25 * Z3 - 0.1 * Z4
true_ps <- plogis(logit_ps)
Tr <- rbinom(n, 1, true_ps)
Y <- 210 + 27.4 * Z1 + 13.7 * Z2 + 13.7 * Z3 + 13.7 * Z4 + rnorm(n)
# Transformed covariates (misspecified)
X1 <- exp(Z1 / 2)
X2 <- Z2 / (1 + exp(Z1)) + 10
X3 <- (Z1 * Z3 / 25 + 0.6)^3
X4 <- (Z2 + Z4 + 20)^2
df <- data.frame(Y, Tr, Z1, Z2, Z3, Z4, X1, X2, X3, X4, true_ps)
cat("n =", n, ", Treatment rate:", mean(Tr), "\n")
cat("True mean: 210\n")
cat("Naive mean (treated):", mean(Y[Tr == 1]), "\n")Expected output:
Sample size: 1000
Treatment rate: ~50%
True population mean: 210
Naive sample mean (treated only): ~220 (biased upward)
Full sample mean: ~210
Step 2: Four Specification Scenarios
Estimate the population mean under four scenarios: (A) both models correct, (B) only outcome model correct, (C) only propensity score correct, (D) both models wrong.
# Scenario A: Both correct (Z)
ps_ok <- pmin(pmax(predict(glm(Tr ~ Z1+Z2+Z3+Z4, family=binomial,
data=df), type="response"), 0.01), 0.99)
mu_ok <- predict(lm(Y ~ Z1+Z2+Z3+Z4, data=df, subset=Tr==1), newdata=df)
ipw_A <- sum(Tr * Y / ps_ok) / sum(Tr / ps_ok)
aipw_A <- mean(mu_ok + Tr * (Y - mu_ok) / ps_ok)
# Scenario B: PS wrong (X), outcome correct (Z)
ps_bad <- pmin(pmax(predict(glm(Tr ~ X1+X2+X3+X4, family=binomial,
data=df), type="response"), 0.01), 0.99)
aipw_B <- mean(mu_ok + Tr * (Y - mu_ok) / ps_bad)
# Scenario C: PS correct, outcome wrong
mu_bad <- predict(lm(Y ~ X1+X2+X3+X4, data=df, subset=Tr==1), newdata=df)
aipw_C <- mean(mu_bad + Tr * (Y - mu_bad) / ps_ok)
# Scenario D: Both wrong
aipw_D <- mean(mu_bad + Tr * (Y - mu_bad) / ps_bad)
cat("A (both correct) AIPW:", round(aipw_A, 1), "\n")
cat("B (PS wrong) AIPW:", round(aipw_B, 1), "\n")
cat("C (OR wrong) AIPW:", round(aipw_C, 1), "\n")
cat("D (both wrong) AIPW:", round(aipw_D, 1), "\n")
cat("True mean: 210\n")Expected output:
| Scenario | PS Model | Outcome Model | IPW | OREG | AIPW | True |
|---|---|---|---|---|---|---|
| A | Correct (Z) | Correct (Z) | ~210 | ~210 | ~210 | 210 |
| B | Wrong (X) | Correct (Z) | ~195 | ~210 | ~210 | 210 |
| C | Correct (Z) | Wrong (X) | ~210 | ~190 | ~210 | 210 |
| D | Wrong (X) | Wrong (X) | ~195 | ~190 | ~185 | 210 |
The table demonstrates the doubly robust property: AIPW recovers the true mean in Scenarios A, B, and C (wherever at least one model is correct). In Scenario D, when both models are misspecified, AIPW offers no guarantee and can be substantially biased.
In Scenario B (wrong propensity score, correct outcome model), the IPW estimator is biased but AIPW is not. What feature of the AIPW score function makes the estimator robust to propensity score misspecification?
Step 3: Extreme Weights and Practical Challenges
Kang and Schafer emphasized that misspecified propensity scores can generate extreme inverse probability weights, degrading performance in finite samples.
# Weight diagnostics
w_ok <- Tr / ps_ok
w_bad <- Tr / ps_bad
cat("=== Weight Diagnostics ===\n")
cat("Correct PS max:", max(w_ok[Tr == 1]), "\n")
cat("Wrong PS max:", max(w_bad[Tr == 1]), "\n")
cat("Wrong PS CV:", sd(w_bad[Tr==1])/mean(w_bad[Tr==1]), "\n")
# Trimming sensitivity
for (trim in c(0.01, 0.05, 0.10, 0.15)) {
ps_t <- pmin(pmax(ps_bad, trim), 1 - trim)
ipw_t <- sum(Tr * Y / ps_t) / sum(Tr / ps_t)
cat("trim =", trim, ": IPW =", round(ipw_t, 1), "\n")
}Expected output — Weight diagnostics:
| Metric | Correct PS | Wrong PS |
|---|---|---|
| Mean weight (treated) | ~2.0 | ~2.5 |
| Max weight | ~8.5 | ~95.0 |
| CV (std/mean) | ~0.75 | ~4.80 |
Trimming effect (Scenario D):
| Trim Level | IPW | AIPW |
|---|---|---|
| 0.01 (minimal) | ~195 | ~185 |
| 0.05 | ~200 | ~192 |
| 0.10 | ~203 | ~198 |
| 0.15 | ~205 | ~201 |
Trimming helps stabilize both IPW and AIPW by capping extreme weights, but it introduces a bias-variance tradeoff.
Step 4: Monte Carlo Study
Run a Monte Carlo study across the four scenarios to assess bias, variance, and coverage systematically.
# Monte Carlo
n_mc <- 500
mc_aipw <- data.frame(A = numeric(n_mc), B = numeric(n_mc),
C = numeric(n_mc), D = numeric(n_mc))
for (sim in 1:n_mc) {
set.seed(sim + 2007)
z1 <- rnorm(n); z2 <- rnorm(n); z3 <- rnorm(n); z4 <- rnorm(n)
lp <- -z1 + 0.5*z2 - 0.25*z3 - 0.1*z4
t_s <- rbinom(n, 1, plogis(lp))
y_s <- 210 + 27.4*z1 + 13.7*z2 + 13.7*z3 + 13.7*z4 + rnorm(n)
x1 <- exp(z1/2); x2 <- z2/(1+exp(z1))+10
x3 <- (z1*z3/25+0.6)^3; x4 <- (z2+z4+20)^2
Z_m <- cbind(z1,z2,z3,z4); X_m <- cbind(x1,x2,x3,x4)
ps_c <- pmin(pmax(predict(glm(t_s~Z_m, family=binomial), type="response"), 0.01), 0.99)
ps_w <- pmin(pmax(predict(glm(t_s~X_m, family=binomial), type="response"), 0.01), 0.99)
mu_c <- predict(lm(y_s~Z_m, subset=t_s==1), newdata=data.frame(Z_m))
mu_w <- predict(lm(y_s~X_m, subset=t_s==1), newdata=data.frame(X_m))
mc_aipw$A[sim] <- mean(mu_c + t_s*(y_s-mu_c)/ps_c)
mc_aipw$B[sim] <- mean(mu_c + t_s*(y_s-mu_c)/ps_w)
mc_aipw$C[sim] <- mean(mu_w + t_s*(y_s-mu_w)/ps_c)
mc_aipw$D[sim] <- mean(mu_w + t_s*(y_s-mu_w)/ps_w)
}
cat("AIPW A (both correct):", mean(mc_aipw$A), "sd:", sd(mc_aipw$A), "\n")
cat("AIPW D (both wrong):", mean(mc_aipw$D), "sd:", sd(mc_aipw$D), "\n")Expected output — Monte Carlo summary:
| Scenario | Estimator | Mean | Bias | SD | RMSE |
|---|---|---|---|---|---|
| A (both correct) | AIPW | ~210 | ~0 | ~1.5 | ~1.5 |
| B (PS wrong) | AIPW | ~210 | ~0 | ~3.0 | ~3.0 |
| C (outcome wrong) | AIPW | ~210 | ~0 | ~3.5 | ~3.5 |
| D (both wrong) | AIPW | ~185 | ~-25 | ~15 | ~29 |
In the Kang-Schafer simulation, AIPW performs WORSE than either IPW or outcome regression alone when both models are misspecified (Scenario D). Why can combining two wrong models be worse than using just one?
Step 5: Compare with Published Results and Modern Extensions
cat("=== Summary ===\n")
cat("Scenario A AIPW:", round(aipw_A, 1), "\n")
cat("Scenario B AIPW:", round(aipw_B, 1), "\n")
cat("Scenario C AIPW:", round(aipw_C, 1), "\n")
cat("Scenario D AIPW:", round(aipw_D, 1), "\n")
cat("True: 210\n")
cat("\nConclusion: DR property confirmed for A/B/C; D shows limitations.\n")Expected output — Final comparison:
| Scenario | AIPW Estimate | True Mean | Status |
|---|---|---|---|
| A: Both correct | ~210 | 210 | Unbiased |
| B: PS wrong | ~210 | 210 | DR saves the day |
| C: Outcome wrong | ~210 | 210 | DR saves the day |
| D: Both wrong | ~185 | 210 | Biased (no guarantee) |
Summary
The replication of Kang and Schafer (2007) demonstrates:
-
Double robustness is real. AIPW is consistent when at least one model is correctly specified. The Monte Carlo confirms near-zero bias in Scenarios A, B, and C.
-
Double robustness has limits. When both models are misspecified (Scenario D), AIPW offers no guarantee and can perform worse than simpler methods.
-
Extreme weights are a practical concern. Misspecified propensity scores produce very large weights that inflate variance and amplify bias.
-
Invest in both models. Use flexible ML methods to reduce the chance that either model is badly wrong. Cross-fitted AIPW (as in DML) further reduces finite-sample bias.
-
AIPW is efficient. When both models are correct, AIPW achieves the semiparametric efficiency bound.
Extension Exercises
-
TMLE comparison. Implement targeted maximum likelihood estimation (TMLE) and compare with AIPW across all four scenarios. TMLE adds a targeting step that can improve finite-sample performance.
-
ML nuisance estimation. Replace logistic regression and OLS with random forests for both models. Does flexible ML estimation help in Scenario D?
-
Entropy balancing. Replace IPW with entropy balancing weights (Hainmueller, 2012) and examine whether improved covariate balance reduces sensitivity to PS misspecification.
-
Overlap violations. Increase the magnitude of the PS coefficients to create more extreme propensity scores. At what point does AIPW break down even in Scenario A?
-
Sample size effects. Run the Monte Carlo with n = 200, 500, 1000, and 5000. How quickly does the finite-sample bias shrink?
-
Cross-fitted AIPW. Implement AIPW with cross-fitting (as in DML) and compare with the plug-in AIPW used in the original simulation. Does cross-fitting improve Scenario D?
-
ATT estimation. Modify the estimand from the population mean to the average treatment effect on the treated. How do the four-scenario results change?
-
Calibrated weights. Implement calibrated IPW weights that satisfy moment conditions exactly. Does calibration improve IPW performance in Scenario B?