Lab·tutorial·6 min read

tutorial90 minutes

Lab: Experimental Design and RCTs

Design and analyze a randomized controlled trial: randomize treatment, verify balance, estimate ATEs, explore heterogeneity, and conduct power analysis.

Method: Experimental Design
Languages: Python, R, Stata
Dataset: Simulated RCT data (job training program)

Overview

In this lab you will work through the full lifecycle of a randomized controlled trial (RCT) evaluating a hypothetical job training program. You will simulate random assignment, verify that randomization produced balanced groups, estimate the average treatment effect (ATE), test for heterogeneous effects, and conduct a power analysis.

What you will learn:

How to randomly assign treatment and verify covariate balance
How to estimate the ATE with and without regression adjustment
How to test for heterogeneous treatment effects by subgroup
How to conduct a power analysis for planning future experiments
Common pitfalls in experimental analysis

Prerequisites: Familiarity with hypothesis testing (t-tests, p-values) and basic regression (OLS).

Step 1: Simulate the Experimental Data

We simulate 1,000 participants in a job training RCT. Each participant has baseline characteristics (age, education, prior earnings) and is randomly assigned to treatment or control.

1# First-time setup: install.packages(c("estimatr", "modelsummary"))
2library(estimatr)
3library(modelsummary)
4
5set.seed(42)
6n <- 1000
7
8age <- round(rnorm(n, 35, 8))
9educ <- sample(c(10, 12, 14, 16, 18), n, replace = TRUE,
10             prob = c(0.1, 0.3, 0.25, 0.25, 0.1))
11prior_earnings <- 20000 + 1500 * educ + 200 * age + rnorm(n, sd = 5000)
12female <- rbinom(n, 1, 0.45)
13
14treat <- rbinom(n, 1, 0.5)
15
16tau_i <- 3000 + 500 * (educ < 14)
17earnings_post <- 25000 + 1200 * educ + 150 * age - 2000 * female +
18               tau_i * treat + rnorm(n, sd = 6000)
19
20df <- data.frame(age, educ, female, prior_earnings, treat, earnings_post)
21
22cat("Treatment:", sum(treat), " Control:", sum(1 - treat), "\n")
23summary(df)

Requiresestimatr modelsummary

Expected output:

Sample data (first 5 rows):

age	educ	female	prior_earnings	treat	earnings_post
39	14	0	50,312	1	56,891
31	12	1	40,105	0	39,422
42	16	0	55,780	1	61,203
28	10	1	35,620	0	30,115
36	18	0	58,410	0	52,748

Summary statistics:

Variable	Mean	Std Dev	Min	Max
age	35.0	8.0	~14	~60
educ	13.8	2.4	10	18
female	0.45	0.50	0	1
prior_earnings	47,700	7,500	~20,000	~75,000
treat	0.50	0.50	0	1
earnings_post	47,600	8,000	~15,000	~80,000

Approximately 500 participants are assigned to treatment and 500 to control.

Step 2: Check Covariate Balance

Randomization should produce groups that look similar on pre-treatment characteristics. A balance table is standard in any experimental paper.

1# Balance table
2balance_vars <- c("age", "educ", "female", "prior_earnings")
3
4balance <- data.frame(
5Variable = balance_vars,
6Treat = sapply(balance_vars, function(v) mean(df[df$treat == 1, v])),
7Control = sapply(balance_vars, function(v) mean(df[df$treat == 0, v])),
8p_value = sapply(balance_vars, function(v)
9  t.test(df[df$treat == 1, v], df[df$treat == 0, v])$p.value)
10)
11balance$Diff <- balance$Treat - balance$Control
12print(round(balance, 3))

Expected output:

Variable	Treat Mean	Control Mean	Difference	p-value
age	35.1	34.9	0.2	0.74
educ	13.9	13.7	0.2	0.31
female	0.44	0.46	-0.02	0.58
prior_earnings	47,850	47,550	300	0.55

All p-values are well above 0.05, confirming that randomization produced balanced treatment and control groups. No covariate systematically predicts treatment assignment.

Concept Check

You run balance checks on 20 baseline covariates and find that one has a p-value of 0.03. Should you be concerned about failed randomization?

Yes — a p-value below 0.05 means the groups are not balanced and randomization failed.No — with 20 tests you expect about 1 rejection by chance. Look at the joint F-test and the overall pattern, not individual p-values.Re-randomize until all p-values exceed 0.05.Drop that covariate from the analysis.

Step 3: Estimate the Average Treatment Effect

The simplest ATE estimator in an RCT is the difference in means. Regression adjustment can improve precision.

1# Method 1: Difference in means
2ate_simple <- mean(df$earnings_post[df$treat == 1]) -
3            mean(df$earnings_post[df$treat == 0])
4cat("Simple difference in means:", round(ate_simple), "\n")
5
6# Method 2: OLS with robust SEs
7m1 <- lm_robust(earnings_post ~ treat, data = df, se_type = "HC2")
8cat("OLS (no controls):", round(coef(m1)["treat"]),
9  "(SE:", round(m1$std.error["treat"]), ")\n")
10
11# Method 3: OLS with controls (Lin 2013)
12df$age_c <- df$age - mean(df$age)
13df$educ_c <- df$educ - mean(df$educ)
14m2 <- lm_robust(earnings_post ~ treat * (age_c + educ_c + female),
15              data = df, se_type = "HC2")
16cat("OLS (Lin estimator):", round(coef(m2)["treat"]),
17  "(SE:", round(m2$std.error["treat"]), ")\n")

Expected output:

Method	ATE Estimate	Robust SE	95% CI
Simple difference in means	~$3,200	~$510	[$2,200, $4,200]
OLS (no controls)	~$3,200	~$510	[$2,200, $4,200]
OLS with controls (Lin estimator)	~$3,200	~$460	[$2,300, $4,100]

The point estimates are nearly identical across all three methods — the similarity is expected in a well-randomized experiment. Adding controls (the Lin estimator) does not change the coefficient but reduces the standard error by absorbing residual variation, yielding a tighter confidence interval.

Step 4: Test for Heterogeneous Treatment Effects

Does the training program work differently for different subgroups?

1# Interaction with education level
2df$low_educ <- as.integer(df$educ < 14)
3m_het <- lm_robust(earnings_post ~ treat * low_educ, data = df, se_type = "HC2")
4summary(m_het)
5
6cat("ATE for high-educ:", coef(m_het)["treat"], "\n")
7cat("Additional for low-educ:", coef(m_het)["treat:low_educ"], "\n")
8
9# Interaction with gender
10m_gender <- lm_robust(earnings_post ~ treat * female, data = df, se_type = "HC2")
11cat("Gender interaction p-value:", summary(m_gender)$coefficients["treat:female", "Pr(>|t|)"], "\n")

Expected output:

Heterogeneous effects by education:

Subgroup	ATE Estimate	SE	p-value
High education (educ >= 14)	~$3,000	~$600	< 0.001
Low education (educ < 14)	~$3,500	~$700	< 0.001
Interaction (treat x low_educ)	~$500	~$900	~0.55

Heterogeneous effects by gender:

Subgroup	ATE Estimate	p-value on interaction
Male (female = 0)	~$3,200	—
Female (female = 1)	~$3,200	~0.90 (not significant)

The DGP builds in a $500 bonus effect for low-education participants (tau_i = 3000 + 500 * (educ < 14)), but with this sample size the interaction term is not statistically significant. The gender interaction is near zero because the DGP includes no differential treatment effect by gender.

Step 5: Power Analysis

Before running an experiment, you need to determine the required sample size to detect a meaningful effect. Let us compute power for this training program.

1# Power analysis
2effect_size <- 3000
3sd_outcome <- sd(df$earnings_post)
4cohen_d <- effect_size / sd_outcome
5
6# Required sample size
7result <- power.t.test(delta = effect_size, sd = sd_outcome,
8                      power = 0.80, sig.level = 0.05,
9                      type = "two.sample")
10cat("Cohen's d:", round(cohen_d, 3), "\n")
11cat("Required N per group:", ceiling(result$n), "\n")
12cat("Total N required:", ceiling(result$n) * 2, "\n")
13
14# Power curve
15n_seq <- seq(50, 1000, by = 25)
16power_vals <- sapply(n_seq, function(n)
17power.t.test(n = n, delta = effect_size, sd = sd_outcome,
18             sig.level = 0.05)$power)
19
20plot(n_seq * 2, power_vals, type = "l", lwd = 2, col = "blue",
21   xlab = "Total Sample Size", ylab = "Power",
22   main = "Power Curve for Job Training RCT")
23abline(h = 0.8, col = "red", lty = 2)

Expected output:

Parameter	Value
Effect size (dollars)	$3,000
SD of outcome	~$8,000
Cohen's d	~0.375
Required N per group (80% power)	~112
Total N required	~224
Actual N in our experiment	1,000

Power at various sample sizes:

Total N	Power
100	~0.30
200	~0.70
224	~0.80
500	~0.97
1,000	~1.00

Concept Check

You plan an RCT and your power analysis says you need 500 participants total. Your budget allows 600. A colleague suggests assigning 400 to treatment and 200 to control to 'learn more about the treatment group.' Is this a good idea?

Yes — more treated observations means better estimates of the treatment effect.No — equal allocation (300/300) maximizes statistical power for a given total sample size.It depends on the effect size.Yes — with 600 available and only 500 needed, you can afford to waste some power with unequal allocation.

Step 6: Attrition and Compliance

Real experiments face attrition (participants dropping out) and non-compliance (participants not taking the treatment). Let us simulate and handle these issues.

1# First-time setup: install.packages(c("leebounds"))
2# Simulate differential attrition
3set.seed(123)
4attrition_prob <- 0.08 + 0.04 * (1 - df$treat)
5df$observed <- rbinom(n, 1, 1 - attrition_prob)
6
7cat("Attrition (treatment):", 1 - mean(df$observed[df$treat == 1]), "\n")
8cat("Attrition (control):", 1 - mean(df$observed[df$treat == 0]), "\n")
9
10# Naive estimate on observed sample
11df_obs <- df[df$observed == 1, ]
12m_naive <- lm_robust(earnings_post ~ treat, data = df_obs, se_type = "HC2")
13cat("Naive ATE:", round(coef(m_naive)["treat"]), "\n")
14
15# For Lee bounds, see the leebounds package
16# install.packages("leebounds")
17# library(leebounds)
18# leebounds(earnings_post ~ treat, data = df_obs)

Requiresleebounds

Expected output:

Attrition rates:

Group	Attrition Rate	N Observed
Treatment	~8%	~460
Control	~12%	~440
Difference	~4 pp	—

Lee (2009) bounds for the ATE:

Estimate	Value
Naive ATE (observed sample)	~$3,300
Lee lower bound	~$2,600
Lee upper bound	~$3,800

The control group has a higher attrition rate (about 12% vs. 8%), meaning the remaining control group may be positively selected. The Lee bounds trim the treatment group (the lower-attrition group) to equalize attrition rates, providing a worst-case interval for the true ATE. Because both bounds remain positive, we can be confident the treatment effect is positive even under differential attrition.

Step 7: Exercises

Stratified randomization. Modify the randomization to stratify by gender and education level. Show that stratified randomization produces better balance than simple randomization.
Cluster randomization. Suppose treatment is assigned at the firm level (50 firms, 20 workers each). Simulate this design and compare the standard errors with individual-level randomization.
Pre-analysis plan. Write a short pre-analysis plan specifying: (a) the primary outcome, (b) the estimand, (c) the estimation method, (d) the subgroups you will examine, and (e) how you will handle attrition.
Multiple outcomes. Add two more outcome variables (employment status and hours worked) and apply a Bonferroni correction for multiple hypothesis testing.

Expected output

If your code runs correctly, expect to see:

Balance table: No statistically significant differences between treatment and control groups on age, education, gender, or prior earnings (most p-values should exceed 0.05; occasional false positives may occur)
Simple difference in means (ATE): Around $2,800–$3,500 (true average ATE: approximately $3,200, computed as 0.6 × $3,000 + 0.4 × $3,500 = $3,200, where ~40% of participants have educ < 14 and receive the $500 bonus)
OLS without controls: Same as the difference in means
OLS with controls (Lin estimator): Similar point estimate but with smaller standard error (higher precision)
Heterogeneous effects (by education): Larger treatment effect for less educated (educ < 14): approximately $3,500 vs. $3,000 for higher educated
Power analysis: At 80% power, minimum detectable effect around $1,500–$2,000 with n = 1,000 and this noise level
Sample size: 1,000 participants, approximately 500 treated and 500 control

Summary

In this lab you learned:

Random assignment provides a basis for unconfoundedness, but balance verification with a balance table is still important, as is addressing attrition or non-compliance
The simple difference in means is an unbiased estimator of the ATE in an RCT
Regression adjustment (especially the Lin estimator) can improve precision without introducing bias
Heterogeneous treatment effect analysis is best pre-specified to avoid multiple testing problems
Power analysis is conducted before the experiment to determine the required sample size
Differential attrition can reintroduce selection bias even in a well-designed RCT

Overview#

Step 1: Simulate the Experimental Data#

Step 2: Check Covariate Balance#

Step 3: Estimate the Average Treatment Effect#

Step 4: Test for Heterogeneous Treatment Effects#

Step 5: Power Analysis#

Step 6: Attrition and Compliance#

Step 7: Exercises#

Summary#

Overview

Step 1: Simulate the Experimental Data

Step 2: Check Covariate Balance

Step 3: Estimate the Average Treatment Effect

Step 4: Test for Heterogeneous Treatment Effects

Step 5: Power Analysis

Step 6: Attrition and Compliance

Step 7: Exercises

Summary