MethodAtlas
tutorial90 minutes

Lab: Experimental Design and RCTs

Design and analyze a randomized controlled trial step by step. Learn to randomize treatment, verify balance, estimate average treatment effects, explore heterogeneity, and conduct power analysis.

Overview

In this lab you will work through the full lifecycle of a randomized controlled trial (RCT) evaluating a hypothetical job training program. You will simulate random assignment, verify that randomization produced balanced groups, estimate the average treatment effect (ATE), test for heterogeneous effects, and conduct a power analysis.

What you will learn:

  • How to randomly assign treatment and verify covariate balance
  • How to estimate the ATE with and without regression adjustment
  • How to test for heterogeneous treatment effects by subgroup
  • How to conduct a power analysis for planning future experiments
  • Common pitfalls in experimental analysis

Prerequisites: Familiarity with hypothesis testing (t-tests, p-values) and basic regression (OLS).


Step 1: Simulate the Experimental Data

We simulate 1,000 participants in a job training RCT. Each participant has baseline characteristics (age, education, prior earnings) and is randomly assigned to treatment or control.

library(estimatr)
library(modelsummary)

set.seed(42)
n <- 1000

age <- round(rnorm(n, 35, 8))
educ <- sample(c(10, 12, 14, 16, 18), n, replace = TRUE,
             prob = c(0.1, 0.3, 0.25, 0.25, 0.1))
prior_earnings <- 20000 + 1500 * educ + 200 * age + rnorm(n, sd = 5000)
female <- rbinom(n, 1, 0.45)

treat <- rbinom(n, 1, 0.5)

tau_i <- 3000 + 500 * (educ < 14)
earnings_post <- 25000 + 1200 * educ + 150 * age - 2000 * female +
               tau_i * treat + rnorm(n, sd = 6000)

df <- data.frame(age, educ, female, prior_earnings, treat, earnings_post)

cat("Treatment:", sum(treat), " Control:", sum(1 - treat), "\n")
summary(df)

Expected output:

Sample data (first 5 rows):

ageeducfemaleprior_earningstreatearnings_post
3914050,312156,891
3112140,105039,422
4216055,780161,203
2810135,620030,115
3618058,410052,748

Summary statistics:

VariableMeanStd DevMinMax
age35.08.0~14~60
educ13.82.41018
female0.450.5001
prior_earnings47,7007,500~20,000~75,000
treat0.500.5001
earnings_post47,6008,000~15,000~80,000

Approximately 500 participants are assigned to treatment and 500 to control.


Step 2: Check Covariate Balance

Randomization should produce groups that look similar on pre-treatment characteristics. A balance table is standard in any experimental paper.

# Balance table
balance_vars <- c("age", "educ", "female", "prior_earnings")

balance <- data.frame(
Variable = balance_vars,
Treat = sapply(balance_vars, function(v) mean(df[df$treat == 1, v])),
Control = sapply(balance_vars, function(v) mean(df[df$treat == 0, v])),
p_value = sapply(balance_vars, function(v)
  t.test(df[df$treat == 1, v], df[df$treat == 0, v])$p.value)
)
balance$Diff <- balance$Treat - balance$Control
print(round(balance, 3))

Expected output:

VariableTreat MeanControl MeanDifferencep-value
age35.134.90.20.74
educ13.913.70.20.31
female0.440.46-0.020.58
prior_earnings47,85047,5503000.55

All p-values are well above 0.05, confirming that randomization produced balanced treatment and control groups. No covariate systematically predicts treatment assignment.

Concept Check

You run balance checks on 20 baseline covariates and find that one has a p-value of 0.03. Should you be concerned about failed randomization?


Step 3: Estimate the Average Treatment Effect

The simplest ATE estimator in an RCT is the difference in means. Regression adjustment can improve precision.

# Method 1: Difference in means
ate_simple <- mean(df$earnings_post[df$treat == 1]) -
            mean(df$earnings_post[df$treat == 0])
cat("Simple difference in means:", round(ate_simple), "\n")

# Method 2: OLS with robust SEs
m1 <- lm_robust(earnings_post ~ treat, data = df, se_type = "HC2")
cat("OLS (no controls):", round(coef(m1)["treat"]),
  "(SE:", round(m1$std.error["treat"]), ")\n")

# Method 3: OLS with controls (Lin 2013)
df$age_c <- df$age - mean(df$age)
df$educ_c <- df$educ - mean(df$educ)
m2 <- lm_robust(earnings_post ~ treat * (age_c + educ_c + female),
              data = df, se_type = "HC2")
cat("OLS (Lin estimator):", round(coef(m2)["treat"]),
  "(SE:", round(m2$std.error["treat"]), ")\n")

Expected output:

MethodATE EstimateRobust SE95% CI
Simple difference in means~$3,200~$510[$2,200, $4,200]
OLS (no controls)~$3,200~$510[$2,200, $4,200]
OLS with controls (Lin estimator)~$3,200~$460[$2,300, $4,100]

The point estimates are nearly identical across all three methods — this is expected in a well-randomized experiment. Adding controls (the Lin estimator) does not change the coefficient but reduces the standard error by absorbing residual variation, yielding a tighter confidence interval.


Step 4: Test for Heterogeneous Treatment Effects

Does the training program work differently for different subgroups?

# Interaction with education level
df$low_educ <- as.integer(df$educ < 14)
m_het <- lm_robust(earnings_post ~ treat * low_educ, data = df, se_type = "HC2")
summary(m_het)

cat("ATE for high-educ:", coef(m_het)["treat"], "\n")
cat("Additional for low-educ:", coef(m_het)["treat:low_educ"], "\n")

# Interaction with gender
m_gender <- lm_robust(earnings_post ~ treat * female, data = df, se_type = "HC2")
cat("Gender interaction p-value:", summary(m_gender)$coefficients["treat:female", "Pr(>|t|)"], "\n")

Expected output:

Heterogeneous effects by education:

SubgroupATE EstimateSEp-value
High education (educ >= 14)~$3,000~$600< 0.001
Low education (educ < 14)~$3,500~$700< 0.001
Interaction (treat x low_educ)~$500~$900~0.55

Heterogeneous effects by gender:

SubgroupATE Estimatep-value on interaction
Male (female = 0)~$3,200
Female (female = 1)~$3,200~0.90 (not significant)

The DGP builds in a $500 bonus effect for low-education participants (tau_i = 3000 + 500 * (educ < 14)), but with this sample size the interaction term is not statistically significant. The gender interaction is near zero because the DGP includes no differential treatment effect by gender.


Step 5: Power Analysis

Before running an experiment, you need to determine the required sample size to detect a meaningful effect. Let us compute power for this training program.

# Power analysis
effect_size <- 3000
sd_outcome <- sd(df$earnings_post)
cohen_d <- effect_size / sd_outcome

# Required sample size
result <- power.t.test(delta = effect_size, sd = sd_outcome,
                      power = 0.80, sig.level = 0.05,
                      type = "two.sample")
cat("Cohen's d:", round(cohen_d, 3), "\n")
cat("Required N per group:", ceiling(result$n), "\n")
cat("Total N required:", ceiling(result$n) * 2, "\n")

# Power curve
n_seq <- seq(50, 1000, by = 25)
power_vals <- sapply(n_seq, function(n)
power.t.test(n = n, delta = effect_size, sd = sd_outcome,
             sig.level = 0.05)$power)

plot(n_seq * 2, power_vals, type = "l", lwd = 2, col = "blue",
   xlab = "Total Sample Size", ylab = "Power",
   main = "Power Curve for Job Training RCT")
abline(h = 0.8, col = "red", lty = 2)

Expected output:

ParameterValue
Effect size (dollars)$3,000
SD of outcome~$8,000
Cohen's d~0.375
Required N per group (80% power)~112
Total N required~224
Actual N in our experiment1,000

Power at various sample sizes:

Total NPower
100~0.30
200~0.70
224~0.80
500~0.97
1,000~1.00
Concept Check

You plan an RCT and your power analysis says you need 500 participants total. Your budget allows 600. A colleague suggests assigning 400 to treatment and 200 to control to 'learn more about the treatment group.' Is this a good idea?


Step 6: Attrition and Compliance

Real experiments face attrition (participants dropping out) and non-compliance (participants not taking the treatment). Let us simulate and handle these issues.

# Simulate differential attrition
set.seed(123)
attrition_prob <- 0.08 + 0.04 * (1 - df$treat)
df$observed <- rbinom(n, 1, 1 - attrition_prob)

cat("Attrition (treatment):", 1 - mean(df$observed[df$treat == 1]), "\n")
cat("Attrition (control):", 1 - mean(df$observed[df$treat == 0]), "\n")

# Naive estimate on observed sample
df_obs <- df[df$observed == 1, ]
m_naive <- lm_robust(earnings_post ~ treat, data = df_obs, se_type = "HC2")
cat("Naive ATE:", round(coef(m_naive)["treat"]), "\n")

# For Lee bounds, see the leebounds package
# install.packages("leebounds")
# library(leebounds)
# leebounds(earnings_post ~ treat, data = df_obs)
Requiresleebounds

Expected output:

Attrition rates:

GroupAttrition RateN Observed
Treatment~8%~460
Control~12%~440
Difference~4 pp

Lee (2009) bounds for the ATE:

EstimateValue
Naive ATE (observed sample)~$3,300
Lee lower bound~$2,600
Lee upper bound~$3,800

The control group has a higher attrition rate (about 12% vs. 8%), meaning the remaining control group may be positively selected. The Lee bounds trim the treatment group (the lower-attrition group) to equalize attrition rates, providing a worst-case interval for the true ATE. Because both bounds remain positive, we can be confident the treatment effect is positive even under differential attrition.


Step 7: Exercises

Try these on your own:

  1. Stratified randomization. Modify the randomization to stratify by gender and education level. Show that stratified randomization produces better balance than simple randomization.

  2. Cluster randomization. Suppose treatment is assigned at the firm level (50 firms, 20 workers each). Simulate this design and compare the standard errors with individual-level randomization.

  3. Pre-analysis plan. Write a short pre-analysis plan specifying: (a) the primary outcome, (b) the estimand, (c) the estimation method, (d) the subgroups you will examine, and (e) how you will handle attrition.

  4. Multiple outcomes. Add two more outcome variables (employment status and hours worked) and apply a Bonferroni correction for multiple hypothesis testing.


Summary

In this lab you learned:

  • Random assignment provides a basis for unconfoundedness, but balance verification with a balance table is still important, as is addressing attrition or non-compliance
  • The simple difference in means is an unbiased estimator of the ATE in an RCT
  • Regression adjustment (especially the Lin estimator) can improve precision without introducing bias
  • Heterogeneous treatment effect analysis is best pre-specified to avoid multiple testing problems
  • Power analysis is conducted before the experiment to determine the required sample size
  • Differential attrition can reintroduce selection bias even in a well-designed RCT