Lab·replication·6 min read

replication120 minutes

Replication Lab: Demystifying Double Robustness

Replicate Kang and Schafer (2007) on AIPW under misspecification: compare IPW, outcome regression, and AIPW across four model-correctness scenarios.

Method: Doubly Robust / AIPW Estimation
Languages: Python, R, Stata
Dataset: Simulated data matching Kang and Schafer (2007) DGP

Overview

In this replication lab, you will reproduce the central simulation from a widely cited paper that critically examined doubly robust estimation:

Kang, Joseph D.Y., and Joseph L. Schafer. 2007. "Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data." Statistical Science 22(4): 523–539.

Kang and Schafer (2007) designed an influential simulation to systematically evaluate doubly robust (DR) estimators — specifically the augmented inverse probability weighting (AIPW) estimator — under four scenarios: (1) both propensity score and outcome models correct, (2) only the outcome model correct, (3) only the propensity score correct, and (4) both models incorrect. The paper showed that while AIPW delivers on its theoretical promise (consistency when at least one model is correct), practical performance can degrade when both models are wrong or when propensity scores are extreme.

Why the Kang and Schafer (2007) paper matters: It provided the first rigorous and accessible demonstration of what double robustness means in practice, including its limitations. The paper motivated subsequent developments in targeted learning (TMLE), entropy balancing, and cross-fitted AIPW.

What you will do:

Simulate data following the Kang-Schafer DGP with true and transformed covariates
Estimate the population mean under four model specification scenarios
Compare IPW, outcome regression, and AIPW estimators
Demonstrate the doubly robust property and its practical limitations
Examine extreme weight diagnostics

Step 1: Simulate the Kang-Schafer Data

The DGP generates four latent covariates (Z1–Z4) that determine both treatment assignment and the outcome. The researcher observes only nonlinear transformations (X1–X4) of the latent variables, creating a controlled misspecification scenario.

1set.seed(2007)
2n <- 1000
3
4# True (latent) covariates Z1-Z4 that drive both treatment and outcome
5Z1 <- rnorm(n); Z2 <- rnorm(n); Z3 <- rnorm(n); Z4 <- rnorm(n)
6
7# True propensity score model: P(T=1 | Z)
8logit_ps <- -Z1 + 0.5 * Z2 - 0.25 * Z3 - 0.1 * Z4
9true_ps <- plogis(logit_ps)
10Tr <- rbinom(n, 1, true_ps)
11
12# True outcome model: E[Y | Z] = 210 + linear in Z
13Y <- 210 + 27.4 * Z1 + 13.7 * Z2 + 13.7 * Z3 + 13.7 * Z4 + rnorm(n)
14
15# Transformed covariates: using X instead of Z creates controlled misspecification
16X1 <- exp(Z1 / 2)
17X2 <- Z2 / (1 + exp(Z1)) + 10
18X3 <- (Z1 * Z3 / 25 + 0.6)^3
19X4 <- (Z2 + Z4 + 20)^2
20
21df <- data.frame(Y, Tr, Z1, Z2, Z3, Z4, X1, X2, X3, X4, true_ps)
22
23cat("n =", n, ", Treatment rate:", mean(Tr), "\n")
24cat("True mean: 210\n")
25cat("Naive mean (treated):", mean(Y[Tr == 1]), "\n")

Expected output:

Sample size: 1000
Treatment rate: ~50%
True population mean: 210
Naive sample mean (treated only): ~220 (biased upward)
Full sample mean: ~210

Step 2: Four Specification Scenarios

Estimate the population mean under four scenarios: (A) both models correct, (B) only outcome model correct, (C) only propensity score correct, (D) both models wrong.

1# Scenario A: Both PS and outcome models use correct covariates (Z)
2ps_ok <- pmin(pmax(predict(glm(Tr ~ Z1+Z2+Z3+Z4, family=binomial,
3                              data=df), type="response"), 0.01), 0.99)
4mu_ok <- predict(lm(Y ~ Z1+Z2+Z3+Z4, data=df, subset=Tr==1), newdata=df)
5
6# IPW (Hajek) and AIPW estimators under Scenario A
7ipw_A <- sum(Tr * Y / ps_ok) / sum(Tr / ps_ok)
8aipw_A <- mean(mu_ok + Tr * (Y - mu_ok) / ps_ok)
9
10# Scenario B: PS wrong (X variables), outcome correct (Z variables)
11ps_bad <- pmin(pmax(predict(glm(Tr ~ X1+X2+X3+X4, family=binomial,
12                               data=df), type="response"), 0.01), 0.99)
13aipw_B <- mean(mu_ok + Tr * (Y - mu_ok) / ps_bad)
14
15# Scenario C: PS correct (Z), outcome model wrong (X variables)
16mu_bad <- predict(lm(Y ~ X1+X2+X3+X4, data=df, subset=Tr==1), newdata=df)
17aipw_C <- mean(mu_bad + Tr * (Y - mu_bad) / ps_ok)
18
19# Scenario D: Both models wrong (both use X variables)
20aipw_D <- mean(mu_bad + Tr * (Y - mu_bad) / ps_bad)
21
22cat("A (both correct) AIPW:", round(aipw_A, 1), "\n")
23cat("B (PS wrong)     AIPW:", round(aipw_B, 1), "\n")
24cat("C (OR wrong)     AIPW:", round(aipw_C, 1), "\n")
25cat("D (both wrong)   AIPW:", round(aipw_D, 1), "\n")
26cat("True mean: 210\n")

RequiresAIPW

Expected output:

Scenario	PS Model	Outcome Model	IPW	OREG	AIPW	True
A	Correct (Z)	Correct (Z)	~210	~210	~210	210
B	Wrong (X)	Correct (Z)	~195	~210	~210	210
C	Correct (Z)	Wrong (X)	~210	~190	~210	210
D	Wrong (X)	Wrong (X)	~195	~190	~185	210

The table demonstrates the doubly robust property: AIPW recovers the true mean in Scenarios A, B, and C (wherever at least one model is correct). In Scenario D, when both models are misspecified, AIPW offers no guarantee and can be substantially biased.

Concept Check

In Scenario B (wrong propensity score, correct outcome model), the IPW estimator is biased but AIPW is not. What feature of the AIPW score function makes the estimator robust to propensity score misspecification?

AIPW uses a larger sample size than IPW.The AIPW score function includes an outcome regression term mu(X) that provides a baseline prediction. The IPW component only corrects residuals Y - mu(X), which are small when the outcome model is correct. Errors in the propensity score are multiplied by small residuals instead of large outcomes.AIPW bootstraps the standard errors, which corrects for the wrong PS.AIPW trims extreme propensity scores automatically.

Step 3: Extreme Weights and Practical Challenges

Kang and Schafer (2007) emphasized that misspecified propensity scores can generate extreme inverse probability weights, degrading performance in finite samples.

1# Compare IPW weight distributions under correct vs. wrong PS models
2w_ok <- Tr / ps_ok
3w_bad <- Tr / ps_bad
4
5cat("=== Weight Diagnostics ===\n")
6cat("Correct PS max:", max(w_ok[Tr == 1]), "\n")
7cat("Wrong PS max:", max(w_bad[Tr == 1]), "\n")
8# Coefficient of variation: higher = more variable weights
9cat("Wrong PS CV:", sd(w_bad[Tr==1])/mean(w_bad[Tr==1]), "\n")
10
11# Trimming sensitivity: clip PS to [trim, 1-trim] to reduce extreme weights
12for (trim in c(0.01, 0.05, 0.10, 0.15)) {
13ps_t <- pmin(pmax(ps_bad, trim), 1 - trim)
14ipw_t <- sum(Tr * Y / ps_t) / sum(Tr / ps_t)
15cat("trim =", trim, ": IPW =", round(ipw_t, 1), "\n")
16}

Expected output — Weight diagnostics:

Metric	Correct PS	Wrong PS
Mean weight (treated)	~2.0	~2.5
Max weight	~8.5	~95.0
CV (std/mean)	~0.75	~4.80

Trimming effect (Scenario D):

Trim Level	IPW	AIPW
0.01 (minimal)	~195	~185
0.05	~200	~192
0.10	~203	~198
0.15	~205	~201

Trimming helps stabilize both IPW and AIPW by capping extreme weights, but it introduces a bias-variance tradeoff.

Step 4: Monte Carlo Study

Run a Monte Carlo study across the four scenarios to assess bias, variance, and coverage systematically.

1# Monte Carlo
2n_mc <- 500
3mc_aipw <- data.frame(A = numeric(n_mc), B = numeric(n_mc),
4                     C = numeric(n_mc), D = numeric(n_mc))
5
6for (sim in 1:n_mc) {
7set.seed(sim + 2007)
8z1 <- rnorm(n); z2 <- rnorm(n); z3 <- rnorm(n); z4 <- rnorm(n)
9lp <- -z1 + 0.5*z2 - 0.25*z3 - 0.1*z4
10t_s <- rbinom(n, 1, plogis(lp))
11y_s <- 210 + 27.4*z1 + 13.7*z2 + 13.7*z3 + 13.7*z4 + rnorm(n)
12
13x1 <- exp(z1/2); x2 <- z2/(1+exp(z1))+10
14x3 <- (z1*z3/25+0.6)^3; x4 <- (z2+z4+20)^2
15Z_m <- cbind(z1,z2,z3,z4); X_m <- cbind(x1,x2,x3,x4)
16
17ps_c <- pmin(pmax(predict(glm(t_s~Z_m, family=binomial), type="response"), 0.01), 0.99)
18ps_w <- pmin(pmax(predict(glm(t_s~X_m, family=binomial), type="response"), 0.01), 0.99)
19mu_c <- predict(lm(y_s~Z_m, subset=t_s==1), newdata=list(Z_m=Z_m))
20mu_w <- predict(lm(y_s~X_m, subset=t_s==1), newdata=list(X_m=X_m))
21
22mc_aipw$A[sim] <- mean(mu_c + t_s*(y_s-mu_c)/ps_c)
23mc_aipw$B[sim] <- mean(mu_c + t_s*(y_s-mu_c)/ps_w)
24mc_aipw$C[sim] <- mean(mu_w + t_s*(y_s-mu_w)/ps_c)
25mc_aipw$D[sim] <- mean(mu_w + t_s*(y_s-mu_w)/ps_w)
26}
27
28cat("AIPW A (both correct):", mean(mc_aipw$A), "sd:", sd(mc_aipw$A), "\n")
29cat("AIPW D (both wrong):", mean(mc_aipw$D), "sd:", sd(mc_aipw$D), "\n")

RequiresAIPW

Expected output — Monte Carlo summary:

Scenario	Estimator	Mean	Bias	SD	RMSE
A (both correct)	AIPW	~210	~0	~1.5	~1.5
B (PS wrong)	AIPW	~210	~0	~3.0	~3.0
C (outcome wrong)	AIPW	~210	~0	~3.5	~3.5
D (both wrong)	AIPW	~185	~-25	~15	~29

Concept Check

In the Kang-Schafer simulation, AIPW performs WORSE than either IPW or outcome regression alone when both models are misspecified (Scenario D). Why can combining two wrong models be worse than using just one?

Because AIPW requires more data than simpler estimators.Because the misspecified outcome regression predicts the outcome poorly, and the misspecified IPW correction then amplifies prediction errors through extreme weights. The combination of wrong predictions and wrong weights can produce larger errors than either component alone.Because logistic regression is the wrong model for propensity scores.Because AIPW is not a valid estimator.

Step 5: Compare with Published Results and Modern Extensions

1cat("=== Summary ===\n")
2cat("Scenario A AIPW:", round(aipw_A, 1), "\n")
3cat("Scenario B AIPW:", round(aipw_B, 1), "\n")
4cat("Scenario C AIPW:", round(aipw_C, 1), "\n")
5cat("Scenario D AIPW:", round(aipw_D, 1), "\n")
6cat("True: 210\n")
7cat("\nConclusion: DR property confirmed for A/B/C; D shows limitations.\n")

RequiresAIPW

Expected output — Final comparison:

Scenario	AIPW Estimate	True Mean	Status
A: Both correct	~210	210	Unbiased
B: PS wrong	~210	210	DR saves the day
C: Outcome wrong	~210	210	DR saves the day
D: Both wrong	~185	210	Biased (no guarantee)

Summary

The replication of Kang and Schafer (2007) demonstrates:

Double robustness is real. AIPW is consistent when at least one model is correctly specified. The Monte Carlo confirms near-zero bias in Scenarios A, B, and C.
Double robustness has limits. When both models are misspecified (Scenario D), AIPW offers no guarantee and can perform worse than simpler methods.
Extreme weights are a practical concern. Misspecified propensity scores produce very large weights that inflate variance and amplify bias.
Invest in both models. Use flexible ML methods to reduce the chance that either model is badly wrong. Cross-fitted AIPW (as in DML) further reduces finite-sample bias.
AIPW is efficient. When both models are correct, AIPW achieves the semiparametric efficiency bound.

Extension Exercises

TMLE comparison. Implement targeted maximum likelihood estimation (TMLE) and compare with AIPW across all four scenarios. TMLE adds a targeting step that can improve finite-sample performance.
ML nuisance estimation. Replace logistic regression and OLS with random forests for both models. Does flexible ML estimation help in Scenario D?
Entropy balancing. Replace IPW with entropy balancing weights (Hainmueller, 2012) and examine whether improved covariate balance reduces sensitivity to PS misspecification.
Overlap violations. Increase the magnitude of the PS coefficients to create more extreme propensity scores. At what point does AIPW break down even in Scenario A?
Sample size effects. Run the Monte Carlo with n = 200, 500, 1000, and 5000. How quickly does the finite-sample bias shrink?
Cross-fitted AIPW. Implement AIPW with cross-fitting (as in DML) and compare with the plug-in AIPW used in the original simulation. Does cross-fitting improve Scenario D?
ATT estimation. Modify the estimand from the population mean to the average treatment effect on the treated. How do the four-scenario results change?
Calibrated weights. Implement calibrated IPW weights that satisfy moment conditions exactly. Does calibration improve IPW performance in Scenario B?

Overview#

Step 1: Simulate the Kang-Schafer Data#

Step 2: Four Specification Scenarios#

Step 3: Extreme Weights and Practical Challenges#

Step 4: Monte Carlo Study#

Step 5: Compare with Published Results and Modern Extensions#

Summary#

Extension Exercises#

Overview

Step 1: Simulate the Kang-Schafer Data

Step 2: Four Specification Scenarios

Step 3: Extreme Weights and Practical Challenges

Step 4: Monte Carlo Study

Step 5: Compare with Published Results and Modern Extensions

Summary

Extension Exercises