Lab: Experimental Design and RCTs
Design and analyze a randomized controlled trial step by step. Learn to randomize treatment, verify balance, estimate average treatment effects, explore heterogeneity, and conduct power analysis.
Overview
In this lab you will work through the full lifecycle of a randomized controlled trial (RCT) evaluating a hypothetical job training program. You will simulate random assignment, verify that randomization produced balanced groups, estimate the average treatment effect (ATE), test for heterogeneous effects, and conduct a power analysis.
What you will learn:
- How to randomly assign treatment and verify covariate balance
- How to estimate the ATE with and without regression adjustment
- How to test for heterogeneous treatment effects by subgroup
- How to conduct a power analysis for planning future experiments
- Common pitfalls in experimental analysis
Prerequisites: Familiarity with hypothesis testing (t-tests, p-values) and basic regression (OLS).
Step 1: Simulate the Experimental Data
We simulate 1,000 participants in a job training RCT. Each participant has baseline characteristics (age, education, prior earnings) and is randomly assigned to treatment or control.
library(estimatr)
library(modelsummary)
set.seed(42)
n <- 1000
age <- round(rnorm(n, 35, 8))
educ <- sample(c(10, 12, 14, 16, 18), n, replace = TRUE,
prob = c(0.1, 0.3, 0.25, 0.25, 0.1))
prior_earnings <- 20000 + 1500 * educ + 200 * age + rnorm(n, sd = 5000)
female <- rbinom(n, 1, 0.45)
treat <- rbinom(n, 1, 0.5)
tau_i <- 3000 + 500 * (educ < 14)
earnings_post <- 25000 + 1200 * educ + 150 * age - 2000 * female +
tau_i * treat + rnorm(n, sd = 6000)
df <- data.frame(age, educ, female, prior_earnings, treat, earnings_post)
cat("Treatment:", sum(treat), " Control:", sum(1 - treat), "\n")
summary(df)Expected output:
Sample data (first 5 rows):
| age | educ | female | prior_earnings | treat | earnings_post |
|---|---|---|---|---|---|
| 39 | 14 | 0 | 50,312 | 1 | 56,891 |
| 31 | 12 | 1 | 40,105 | 0 | 39,422 |
| 42 | 16 | 0 | 55,780 | 1 | 61,203 |
| 28 | 10 | 1 | 35,620 | 0 | 30,115 |
| 36 | 18 | 0 | 58,410 | 0 | 52,748 |
Summary statistics:
| Variable | Mean | Std Dev | Min | Max |
|---|---|---|---|---|
| age | 35.0 | 8.0 | ~14 | ~60 |
| educ | 13.8 | 2.4 | 10 | 18 |
| female | 0.45 | 0.50 | 0 | 1 |
| prior_earnings | 47,700 | 7,500 | ~20,000 | ~75,000 |
| treat | 0.50 | 0.50 | 0 | 1 |
| earnings_post | 47,600 | 8,000 | ~15,000 | ~80,000 |
Approximately 500 participants are assigned to treatment and 500 to control.
Step 2: Check Covariate Balance
Randomization should produce groups that look similar on pre-treatment characteristics. A balance table is standard in any experimental paper.
# Balance table
balance_vars <- c("age", "educ", "female", "prior_earnings")
balance <- data.frame(
Variable = balance_vars,
Treat = sapply(balance_vars, function(v) mean(df[df$treat == 1, v])),
Control = sapply(balance_vars, function(v) mean(df[df$treat == 0, v])),
p_value = sapply(balance_vars, function(v)
t.test(df[df$treat == 1, v], df[df$treat == 0, v])$p.value)
)
balance$Diff <- balance$Treat - balance$Control
print(round(balance, 3))Expected output:
| Variable | Treat Mean | Control Mean | Difference | p-value |
|---|---|---|---|---|
| age | 35.1 | 34.9 | 0.2 | 0.74 |
| educ | 13.9 | 13.7 | 0.2 | 0.31 |
| female | 0.44 | 0.46 | -0.02 | 0.58 |
| prior_earnings | 47,850 | 47,550 | 300 | 0.55 |
All p-values are well above 0.05, confirming that randomization produced balanced treatment and control groups. No covariate systematically predicts treatment assignment.
You run balance checks on 20 baseline covariates and find that one has a p-value of 0.03. Should you be concerned about failed randomization?
Step 3: Estimate the Average Treatment Effect
The simplest ATE estimator in an RCT is the difference in means. Regression adjustment can improve precision.
# Method 1: Difference in means
ate_simple <- mean(df$earnings_post[df$treat == 1]) -
mean(df$earnings_post[df$treat == 0])
cat("Simple difference in means:", round(ate_simple), "\n")
# Method 2: OLS with robust SEs
m1 <- lm_robust(earnings_post ~ treat, data = df, se_type = "HC2")
cat("OLS (no controls):", round(coef(m1)["treat"]),
"(SE:", round(m1$std.error["treat"]), ")\n")
# Method 3: OLS with controls (Lin 2013)
df$age_c <- df$age - mean(df$age)
df$educ_c <- df$educ - mean(df$educ)
m2 <- lm_robust(earnings_post ~ treat * (age_c + educ_c + female),
data = df, se_type = "HC2")
cat("OLS (Lin estimator):", round(coef(m2)["treat"]),
"(SE:", round(m2$std.error["treat"]), ")\n")Expected output:
| Method | ATE Estimate | Robust SE | 95% CI |
|---|---|---|---|
| Simple difference in means | ~$3,200 | ~$510 | [$2,200, $4,200] |
| OLS (no controls) | ~$3,200 | ~$510 | [$2,200, $4,200] |
| OLS with controls (Lin estimator) | ~$3,200 | ~$460 | [$2,300, $4,100] |
The point estimates are nearly identical across all three methods — this is expected in a well-randomized experiment. Adding controls (the Lin estimator) does not change the coefficient but reduces the standard error by absorbing residual variation, yielding a tighter confidence interval.
Step 4: Test for Heterogeneous Treatment Effects
Does the training program work differently for different subgroups?
# Interaction with education level
df$low_educ <- as.integer(df$educ < 14)
m_het <- lm_robust(earnings_post ~ treat * low_educ, data = df, se_type = "HC2")
summary(m_het)
cat("ATE for high-educ:", coef(m_het)["treat"], "\n")
cat("Additional for low-educ:", coef(m_het)["treat:low_educ"], "\n")
# Interaction with gender
m_gender <- lm_robust(earnings_post ~ treat * female, data = df, se_type = "HC2")
cat("Gender interaction p-value:", summary(m_gender)$coefficients["treat:female", "Pr(>|t|)"], "\n")Expected output:
Heterogeneous effects by education:
| Subgroup | ATE Estimate | SE | p-value |
|---|---|---|---|
| High education (educ >= 14) | ~$3,000 | ~$600 | < 0.001 |
| Low education (educ < 14) | ~$3,500 | ~$700 | < 0.001 |
| Interaction (treat x low_educ) | ~$500 | ~$900 | ~0.55 |
Heterogeneous effects by gender:
| Subgroup | ATE Estimate | p-value on interaction |
|---|---|---|
| Male (female = 0) | ~$3,200 | — |
| Female (female = 1) | ~$3,200 | ~0.90 (not significant) |
The DGP builds in a $500 bonus effect for low-education participants (tau_i = 3000 + 500 * (educ < 14)), but with this sample size the interaction term is not statistically significant. The gender interaction is near zero because the DGP includes no differential treatment effect by gender.
Step 5: Power Analysis
Before running an experiment, you need to determine the required sample size to detect a meaningful effect. Let us compute power for this training program.
# Power analysis
effect_size <- 3000
sd_outcome <- sd(df$earnings_post)
cohen_d <- effect_size / sd_outcome
# Required sample size
result <- power.t.test(delta = effect_size, sd = sd_outcome,
power = 0.80, sig.level = 0.05,
type = "two.sample")
cat("Cohen's d:", round(cohen_d, 3), "\n")
cat("Required N per group:", ceiling(result$n), "\n")
cat("Total N required:", ceiling(result$n) * 2, "\n")
# Power curve
n_seq <- seq(50, 1000, by = 25)
power_vals <- sapply(n_seq, function(n)
power.t.test(n = n, delta = effect_size, sd = sd_outcome,
sig.level = 0.05)$power)
plot(n_seq * 2, power_vals, type = "l", lwd = 2, col = "blue",
xlab = "Total Sample Size", ylab = "Power",
main = "Power Curve for Job Training RCT")
abline(h = 0.8, col = "red", lty = 2)Expected output:
| Parameter | Value |
|---|---|
| Effect size (dollars) | $3,000 |
| SD of outcome | ~$8,000 |
| Cohen's d | ~0.375 |
| Required N per group (80% power) | ~112 |
| Total N required | ~224 |
| Actual N in our experiment | 1,000 |
Power at various sample sizes:
| Total N | Power |
|---|---|
| 100 | ~0.30 |
| 200 | ~0.70 |
| 224 | ~0.80 |
| 500 | ~0.97 |
| 1,000 | ~1.00 |
You plan an RCT and your power analysis says you need 500 participants total. Your budget allows 600. A colleague suggests assigning 400 to treatment and 200 to control to 'learn more about the treatment group.' Is this a good idea?
Step 6: Attrition and Compliance
Real experiments face attrition (participants dropping out) and non-compliance (participants not taking the treatment). Let us simulate and handle these issues.
# Simulate differential attrition
set.seed(123)
attrition_prob <- 0.08 + 0.04 * (1 - df$treat)
df$observed <- rbinom(n, 1, 1 - attrition_prob)
cat("Attrition (treatment):", 1 - mean(df$observed[df$treat == 1]), "\n")
cat("Attrition (control):", 1 - mean(df$observed[df$treat == 0]), "\n")
# Naive estimate on observed sample
df_obs <- df[df$observed == 1, ]
m_naive <- lm_robust(earnings_post ~ treat, data = df_obs, se_type = "HC2")
cat("Naive ATE:", round(coef(m_naive)["treat"]), "\n")
# For Lee bounds, see the leebounds package
# install.packages("leebounds")
# library(leebounds)
# leebounds(earnings_post ~ treat, data = df_obs)Expected output:
Attrition rates:
| Group | Attrition Rate | N Observed |
|---|---|---|
| Treatment | ~8% | ~460 |
| Control | ~12% | ~440 |
| Difference | ~4 pp | — |
Lee (2009) bounds for the ATE:
| Estimate | Value |
|---|---|
| Naive ATE (observed sample) | ~$3,300 |
| Lee lower bound | ~$2,600 |
| Lee upper bound | ~$3,800 |
The control group has a higher attrition rate (about 12% vs. 8%), meaning the remaining control group may be positively selected. The Lee bounds trim the treatment group (the lower-attrition group) to equalize attrition rates, providing a worst-case interval for the true ATE. Because both bounds remain positive, we can be confident the treatment effect is positive even under differential attrition.
Step 7: Exercises
Try these on your own:
-
Stratified randomization. Modify the randomization to stratify by gender and education level. Show that stratified randomization produces better balance than simple randomization.
-
Cluster randomization. Suppose treatment is assigned at the firm level (50 firms, 20 workers each). Simulate this design and compare the standard errors with individual-level randomization.
-
Pre-analysis plan. Write a short pre-analysis plan specifying: (a) the primary outcome, (b) the estimand, (c) the estimation method, (d) the subgroups you will examine, and (e) how you will handle attrition.
-
Multiple outcomes. Add two more outcome variables (employment status and hours worked) and apply a Bonferroni correction for multiple hypothesis testing.
Summary
In this lab you learned:
- Random assignment provides a basis for unconfoundedness, but balance verification with a balance table is still important, as is addressing attrition or non-compliance
- The simple difference in means is an unbiased estimator of the ATE in an RCT
- Regression adjustment (especially the Lin estimator) can improve precision without introducing bias
- Heterogeneous treatment effect analysis is best pre-specified to avoid multiple testing problems
- Power analysis is conducted before the experiment to determine the required sample size
- Differential attrition can reintroduce selection bias even in a well-designed RCT