MethodAtlas
replication120 minutes

Replication Lab: Demystifying Double Robustness

Replicate the simulation from Kang and Schafer (2007) showing how the doubly robust (AIPW) estimator performs when one or both models are misspecified. Compare IPW, outcome regression, and AIPW across four scenarios of correct and incorrect model specification.

Overview

In this replication lab, you will reproduce the central simulation from a widely cited paper that critically examined doubly robust estimation:

Kang, Joseph D.Y., and Joseph L. Schafer. 2007. "Demystifying Double Robustness: A Simple Tutorial." Statistical Science 22(4): 523–539.

Kang and Schafer designed an influential simulation to systematically evaluate doubly robust (DR) estimators — specifically the augmented inverse probability weighting (AIPW) estimator — under four scenarios: (1) both propensity score and outcome models correct, (2) only the outcome model correct, (3) only the propensity score correct, and (4) both models incorrect. The paper showed that while AIPW delivers on its theoretical promise (consistency when at least one model is correct), practical performance can degrade when both models are wrong or when propensity scores are extreme.

Why the Kang and Schafer paper matters: It provided the first rigorous and accessible demonstration of what double robustness means in practice, including its limitations. The paper motivated subsequent developments in targeted learning (TMLE), entropy balancing, and cross-fitted AIPW.

What you will do:

  • Simulate data following the Kang-Schafer DGP with true and transformed covariates
  • Estimate the population mean under four model specification scenarios
  • Compare IPW, outcome regression, and AIPW estimators
  • Demonstrate the doubly robust property and its practical limitations
  • Examine extreme weight diagnostics

Step 1: Simulate the Kang-Schafer Data

The DGP generates four latent covariates (Z1–Z4) that determine both treatment assignment and the outcome. The researcher observes only nonlinear transformations (X1–X4) of the latent variables, creating a controlled misspecification scenario.

set.seed(2007)
n <- 1000

Z1 <- rnorm(n); Z2 <- rnorm(n); Z3 <- rnorm(n); Z4 <- rnorm(n)

logit_ps <- -Z1 + 0.5 * Z2 - 0.25 * Z3 - 0.1 * Z4
true_ps <- plogis(logit_ps)
Tr <- rbinom(n, 1, true_ps)

Y <- 210 + 27.4 * Z1 + 13.7 * Z2 + 13.7 * Z3 + 13.7 * Z4 + rnorm(n)

# Transformed covariates (misspecified)
X1 <- exp(Z1 / 2)
X2 <- Z2 / (1 + exp(Z1)) + 10
X3 <- (Z1 * Z3 / 25 + 0.6)^3
X4 <- (Z2 + Z4 + 20)^2

df <- data.frame(Y, Tr, Z1, Z2, Z3, Z4, X1, X2, X3, X4, true_ps)

cat("n =", n, ", Treatment rate:", mean(Tr), "\n")
cat("True mean: 210\n")
cat("Naive mean (treated):", mean(Y[Tr == 1]), "\n")

Expected output:

Sample size: 1000
Treatment rate: ~50%
True population mean: 210
Naive sample mean (treated only): ~220 (biased upward)
Full sample mean: ~210

Step 2: Four Specification Scenarios

Estimate the population mean under four scenarios: (A) both models correct, (B) only outcome model correct, (C) only propensity score correct, (D) both models wrong.

# Scenario A: Both correct (Z)
ps_ok <- pmin(pmax(predict(glm(Tr ~ Z1+Z2+Z3+Z4, family=binomial,
                              data=df), type="response"), 0.01), 0.99)
mu_ok <- predict(lm(Y ~ Z1+Z2+Z3+Z4, data=df, subset=Tr==1), newdata=df)

ipw_A <- sum(Tr * Y / ps_ok) / sum(Tr / ps_ok)
aipw_A <- mean(mu_ok + Tr * (Y - mu_ok) / ps_ok)

# Scenario B: PS wrong (X), outcome correct (Z)
ps_bad <- pmin(pmax(predict(glm(Tr ~ X1+X2+X3+X4, family=binomial,
                               data=df), type="response"), 0.01), 0.99)
aipw_B <- mean(mu_ok + Tr * (Y - mu_ok) / ps_bad)

# Scenario C: PS correct, outcome wrong
mu_bad <- predict(lm(Y ~ X1+X2+X3+X4, data=df, subset=Tr==1), newdata=df)
aipw_C <- mean(mu_bad + Tr * (Y - mu_bad) / ps_ok)

# Scenario D: Both wrong
aipw_D <- mean(mu_bad + Tr * (Y - mu_bad) / ps_bad)

cat("A (both correct) AIPW:", round(aipw_A, 1), "\n")
cat("B (PS wrong)     AIPW:", round(aipw_B, 1), "\n")
cat("C (OR wrong)     AIPW:", round(aipw_C, 1), "\n")
cat("D (both wrong)   AIPW:", round(aipw_D, 1), "\n")
cat("True mean: 210\n")
RequiresAIPW

Expected output:

ScenarioPS ModelOutcome ModelIPWOREGAIPWTrue
ACorrect (Z)Correct (Z)~210~210~210210
BWrong (X)Correct (Z)~195~210~210210
CCorrect (Z)Wrong (X)~210~190~210210
DWrong (X)Wrong (X)~195~190~185210

The table demonstrates the doubly robust property: AIPW recovers the true mean in Scenarios A, B, and C (wherever at least one model is correct). In Scenario D, when both models are misspecified, AIPW offers no guarantee and can be substantially biased.

Concept Check

In Scenario B (wrong propensity score, correct outcome model), the IPW estimator is biased but AIPW is not. What feature of the AIPW score function makes the estimator robust to propensity score misspecification?


Step 3: Extreme Weights and Practical Challenges

Kang and Schafer emphasized that misspecified propensity scores can generate extreme inverse probability weights, degrading performance in finite samples.

# Weight diagnostics
w_ok <- Tr / ps_ok
w_bad <- Tr / ps_bad

cat("=== Weight Diagnostics ===\n")
cat("Correct PS max:", max(w_ok[Tr == 1]), "\n")
cat("Wrong PS max:", max(w_bad[Tr == 1]), "\n")
cat("Wrong PS CV:", sd(w_bad[Tr==1])/mean(w_bad[Tr==1]), "\n")

# Trimming sensitivity
for (trim in c(0.01, 0.05, 0.10, 0.15)) {
ps_t <- pmin(pmax(ps_bad, trim), 1 - trim)
ipw_t <- sum(Tr * Y / ps_t) / sum(Tr / ps_t)
cat("trim =", trim, ": IPW =", round(ipw_t, 1), "\n")
}

Expected output — Weight diagnostics:

MetricCorrect PSWrong PS
Mean weight (treated)~2.0~2.5
Max weight~8.5~95.0
CV (std/mean)~0.75~4.80

Trimming effect (Scenario D):

Trim LevelIPWAIPW
0.01 (minimal)~195~185
0.05~200~192
0.10~203~198
0.15~205~201

Trimming helps stabilize both IPW and AIPW by capping extreme weights, but it introduces a bias-variance tradeoff.


Step 4: Monte Carlo Study

Run a Monte Carlo study across the four scenarios to assess bias, variance, and coverage systematically.

# Monte Carlo
n_mc <- 500
mc_aipw <- data.frame(A = numeric(n_mc), B = numeric(n_mc),
                     C = numeric(n_mc), D = numeric(n_mc))

for (sim in 1:n_mc) {
set.seed(sim + 2007)
z1 <- rnorm(n); z2 <- rnorm(n); z3 <- rnorm(n); z4 <- rnorm(n)
lp <- -z1 + 0.5*z2 - 0.25*z3 - 0.1*z4
t_s <- rbinom(n, 1, plogis(lp))
y_s <- 210 + 27.4*z1 + 13.7*z2 + 13.7*z3 + 13.7*z4 + rnorm(n)

x1 <- exp(z1/2); x2 <- z2/(1+exp(z1))+10
x3 <- (z1*z3/25+0.6)^3; x4 <- (z2+z4+20)^2
Z_m <- cbind(z1,z2,z3,z4); X_m <- cbind(x1,x2,x3,x4)

ps_c <- pmin(pmax(predict(glm(t_s~Z_m, family=binomial), type="response"), 0.01), 0.99)
ps_w <- pmin(pmax(predict(glm(t_s~X_m, family=binomial), type="response"), 0.01), 0.99)
mu_c <- predict(lm(y_s~Z_m, subset=t_s==1), newdata=data.frame(Z_m))
mu_w <- predict(lm(y_s~X_m, subset=t_s==1), newdata=data.frame(X_m))

mc_aipw$A[sim] <- mean(mu_c + t_s*(y_s-mu_c)/ps_c)
mc_aipw$B[sim] <- mean(mu_c + t_s*(y_s-mu_c)/ps_w)
mc_aipw$C[sim] <- mean(mu_w + t_s*(y_s-mu_w)/ps_c)
mc_aipw$D[sim] <- mean(mu_w + t_s*(y_s-mu_w)/ps_w)
}

cat("AIPW A (both correct):", mean(mc_aipw$A), "sd:", sd(mc_aipw$A), "\n")
cat("AIPW D (both wrong):", mean(mc_aipw$D), "sd:", sd(mc_aipw$D), "\n")
RequiresAIPW

Expected output — Monte Carlo summary:

ScenarioEstimatorMeanBiasSDRMSE
A (both correct)AIPW~210~0~1.5~1.5
B (PS wrong)AIPW~210~0~3.0~3.0
C (outcome wrong)AIPW~210~0~3.5~3.5
D (both wrong)AIPW~185~-25~15~29
Concept Check

In the Kang-Schafer simulation, AIPW performs WORSE than either IPW or outcome regression alone when both models are misspecified (Scenario D). Why can combining two wrong models be worse than using just one?


Step 5: Compare with Published Results and Modern Extensions

cat("=== Summary ===\n")
cat("Scenario A AIPW:", round(aipw_A, 1), "\n")
cat("Scenario B AIPW:", round(aipw_B, 1), "\n")
cat("Scenario C AIPW:", round(aipw_C, 1), "\n")
cat("Scenario D AIPW:", round(aipw_D, 1), "\n")
cat("True: 210\n")
cat("\nConclusion: DR property confirmed for A/B/C; D shows limitations.\n")
RequiresAIPW

Expected output — Final comparison:

ScenarioAIPW EstimateTrue MeanStatus
A: Both correct~210210Unbiased
B: PS wrong~210210DR saves the day
C: Outcome wrong~210210DR saves the day
D: Both wrong~185210Biased (no guarantee)

Summary

The replication of Kang and Schafer (2007) demonstrates:

  1. Double robustness is real. AIPW is consistent when at least one model is correctly specified. The Monte Carlo confirms near-zero bias in Scenarios A, B, and C.

  2. Double robustness has limits. When both models are misspecified (Scenario D), AIPW offers no guarantee and can perform worse than simpler methods.

  3. Extreme weights are a practical concern. Misspecified propensity scores produce very large weights that inflate variance and amplify bias.

  4. Invest in both models. Use flexible ML methods to reduce the chance that either model is badly wrong. Cross-fitted AIPW (as in DML) further reduces finite-sample bias.

  5. AIPW is efficient. When both models are correct, AIPW achieves the semiparametric efficiency bound.


Extension Exercises

  1. TMLE comparison. Implement targeted maximum likelihood estimation (TMLE) and compare with AIPW across all four scenarios. TMLE adds a targeting step that can improve finite-sample performance.

  2. ML nuisance estimation. Replace logistic regression and OLS with random forests for both models. Does flexible ML estimation help in Scenario D?

  3. Entropy balancing. Replace IPW with entropy balancing weights (Hainmueller, 2012) and examine whether improved covariate balance reduces sensitivity to PS misspecification.

  4. Overlap violations. Increase the magnitude of the PS coefficients to create more extreme propensity scores. At what point does AIPW break down even in Scenario A?

  5. Sample size effects. Run the Monte Carlo with n = 200, 500, 1000, and 5000. How quickly does the finite-sample bias shrink?

  6. Cross-fitted AIPW. Implement AIPW with cross-fitting (as in DML) and compare with the plug-in AIPW used in the original simulation. Does cross-fitting improve Scenario D?

  7. ATT estimation. Modify the estimand from the population mean to the average treatment effect on the treated. How do the four-scenario results change?

  8. Calibrated weights. Implement calibrated IPW weights that satisfy moment conditions exactly. Does calibration improve IPW performance in Scenario B?