Experimental Design
The benchmark for causal inference — random assignment eliminates selection bias by design.
Quick Reference
- When to Use
- When you can randomly assign treatment to units, or when a natural lottery creates as-if random assignment. The benchmark for all other causal inference methods.
- Key Assumption
- Random assignment (treatment independent of potential outcomes), SUTVA (no interference between units), and excludability (assignment affects outcomes only through treatment). With noncompliance, monotonicity is also needed for LATE.
- Common Mistake
- Conflating the intent-to-treat (ITT) effect with the treatment-on-the-treated (TOT) effect when there is noncompliance, or ignoring differential attrition between treatment and control groups.
- Estimated Time
- 2 hours
One-Line Implementation
reg outcome treatment, vce(robust)feols(outcome ~ treatment, data = df, vcov = 'HC1')smf.ols('outcome ~ treatment', data=df).fit(cov_type='HC1')Download Full Analysis Code
Complete scripts with diagnostics, robustness checks, and result export.
Motivating Example: The Oregon Health Insurance Experiment
In 2008, Oregon expanded its Medicaid program but had far more applicants than slots. The state held a lottery — a literal random draw — to decide who would get the opportunity to enroll. The lottery created one of the most important experiments in health economics.
(Finkelstein et al., 2012)The researchers could compare lottery winners (who were offered insurance) to lottery losers. Because assignment was random, the two groups were identical in expectation on every dimension — income, health status, education, motivation, everything. Any difference in outcomes could be attributed to the insurance offer itself.
This balance is the power of experimental design. You do not need to measure and control for every confounder. Randomization handles it for you.
But here is the catch: not everyone who won the lottery actually enrolled in Medicaid. And this non-compliance creates a gap between what was randomly assigned (the offer) and what was actually received (the insurance). Understanding this gap is one of the central lessons of this page.
A. Overview: Why Experiments Are the Benchmark for Causal Inference
The is defined as:
The fundamental problem is that we never observe both for the same unit. But random assignment solves the comparison problem — it eliminates selection bias in expectation, which is why causal inference requires careful research design. When treatment is randomly assigned:
In plain language: the average untreated outcome is the same for the treated group and the control group. The control group is a valid stand-in for the counterfactual. A simple difference in means recovers the ATE:
This estimator is exactly equivalent to running OLS with a single treatment dummy. The regression just gives you standard errors for free.
The Three Pillars of a Good Experiment
- Random assignment — units are allocated to treatment and control by a mechanism the researcher controls.
- No interference — one unit's treatment does not affect another unit's outcome (the Stable Unit Treatment Value Assumption, or SUTVA).
- Excludability — the assignment mechanism affects outcomes only through the treatment itself, not through other channels.
Common Confusions
B. Identification: What Makes Randomization Work
The Mechanics of Randomization
Randomization creates through a simple but powerful mechanism: it makes treatment assignment statistically independent of potential outcomes.
This independence means there is no — a concept explored in depth on the selection bias foundations page. The people in the treatment group are, on average, identical to those in the control group in every way — observed and unobserved.
Intent-to-Treat (ITT)
When you compare outcomes by assignment (regardless of whether subjects actually took up the treatment), you get the ITT:
where is the random assignment indicator. Under intact randomization and no differential attrition, the ITT is a valid causal effect. It answers: "What is the effect of being assigned to the treatment group?"
LATE for Non-Compliance
In the Oregon experiment, was winning the lottery, but (actually enrolling in Medicaid) was a choice. Some winners did not enroll (never-takers), and in principle, some non-winners might have found other ways to enroll (always-takers).
Using the lottery as an for actual enrollment, you can estimate the :
This expression is the Wald ratio: reduced form ÷ first stage. The numerator () is the reduced-form effect of the instrument on the outcome ; the denominator () is the first-stage effect of on the treatment take-up . The ratio gives you the causal effect of treatment for — those whose treatment status was actually changed by the random assignment.
C. Visual Intuition
Think of randomization as a shuffling machine. You take your sample of people — with all their differences in motivation, ability, health, income — and you shuffle them into two groups completely at random. Each group ends up being a miniature copy of the other, on average.
The key visual: imagine a balance scale. Before randomization, the treatment group could be heavier on one side (more motivated people, higher income, whatever). After randomization, the scale is balanced — not perfectly for any single experiment, but in expectation across repeated randomizations.
This expectation is why balance tables matter. If your randomization worked, the treatment and control groups should look similar on all observed characteristics. A balance table lets you verify this expectation.
Treatment Effect Under Randomization
As sample size grows, the estimated treatment effect converges to the true effect and the p-value shrinks. Noise inflates sampling variance but randomization keeps the estimator unbiased.
Computed Results
- Estimated Effect (mean)
- 3.00
- Std. Error of Estimator
- 0.707
- t-statistic
- 4.24
Wald Ratio Builder
The Local Average Treatment Effect (LATE) equals the reduced-form effect on the outcome (ITTY) divided by the first-stage effect on treatment take-up (ITTD). Drag the sliders to see how the Wald ratio changes.
Intent-to-treat effect on the outcome Y
Share of compliers (first-stage coefficient)
Wald Ratio
LATE = ITTY / ITTD
2.00 / 0.70 = 2.857
Key insight: The Wald estimator rescales the reduced-form ITT effect by the compliance rate. If only 50% of people assigned to treatment actually take it (ITTD = 0.5), the LATE is twice the ITT. This identifies the causal effect for compliers only — those whose treatment status is affected by the instrument. When the first stage is weak, the denominator is near zero, inflating both the point estimate and its variance.
Why Randomization?
DGP: Y = 2.0·D + 2·U + ε. In the randomized arm, D is coin-flip assigned. In the observational arm, P(D=1) = sigmoid(1.5·U), so U confounds the treatment-outcome relationship. N = 200 per arm.
Estimation Results
| Estimator | β̂ | SE | 95% CI | Bias |
|---|---|---|---|---|
| OLS (randomized)closest | 2.219 | 0.320 | [1.59, 2.85] | +0.219 |
| OLS (self-selected) | 4.039 | 0.250 | [3.55, 4.53] | +2.039 |
| True β | 2.000 | — | — | — |
Observations in each arm
The causal effect of D on Y
How strongly U drives self-selection into treatment (0 = no confounding)
Why the difference?
With self-selection (confounding = 1.5), OLS is biased by +2.04 because people who choose to be treated have systematically different unobserved characteristics that also affect the outcome. Randomization eliminates this bias entirely: by assigning treatment randomly, it guarantees that treated and control groups are comparable on average (OLS bias = +0.22).
D. Mathematical Derivation
Don't worry about the notation yet — here's what this means in words: Random assignment makes the treated and control groups identical in expectation, so a simple comparison of group averages recovers the true causal effect.
Start with the observed difference in means:
By the switching equation, the observed outcome is . So:
Now add and subtract :
Under random assignment, , so the selection bias term is zero:
Therefore .
The ATT also equals the ATE under random assignment, because treatment is independent of potential outcomes.
E. Implementation
library(fixest)
library(modelsummary)
# ---- Balance table ----
# Regress each covariate on treatment to verify randomization worked
balance_vars <- c("age", "female", "income", "education")
bal <- lapply(balance_vars, function(v) {
feols(as.formula(paste(v, "~ treatment")), data = df)
})
# Display balance results; small coefficients = good randomization
modelsummary(bal, stars = TRUE)
# ---- ITT estimate ----
# Simple regression of outcome on treatment assignment
# vcov = "HC1" gives heteroskedasticity-robust standard errors
itt <- feols(outcome ~ treatment, data = df, vcov = "HC1")
summary(itt)
# ---- LATE via IV (for non-compliance) ----
# outcome ~ exogenous | fixed_effects | endogenous ~ instrument
# Uses assignment as instrument for actual takeup
late <- feols(outcome ~ 1 | 0 | takeup ~ assignment, data = df, vcov = "HC1")
summary(late)F. Diagnostics and Robustness Checks
Balance Checks
An essential diagnostic for any experiment. Compare pre-treatment covariates across treatment and control groups. Report:
- Group means and standard deviations
- Difference and its p-value (or standardized difference)
- An F-test for joint significance of all covariates predicting treatment
Attrition Checks
Attrition (people dropping out of the study) is only a problem if it is differential — if treatment causes people to leave the sample at different rates. Check:
- Is the attrition rate similar across treatment and control?
- Among non-attritors, is balance still maintained?
- Consider Lee bounds for worst-case scenarios.
Compliance Checks
Report the first-stage compliance rate: what fraction of the assigned-to-treatment group actually received treatment? A first-stage below 100% means you need to decide between ITT and LATE.
Interpreting Results
- The ITT is a policy-relevant parameter: it tells you what happens when you roll out an intervention in practice, including non-compliance.
- The LATE tells you what the treatment does for people who actually take it up, but it only applies to compliers.
- If compliance is near 100%, ITT and LATE are essentially the same.
- It is recommended to report the ITT. Report the LATE as a complement, not a replacement.
G. What Can Go Wrong
| Threat | What It Does | How to Diagnose |
|---|---|---|
| Non-compliance | Creates a gap between assignment and receipt | Report compliance rates; use LATE/IV |
| Attrition | Breaks random assignment if differential | Compare attrition rates; Lee bounds |
| Spillovers (SUTVA violation) | Treatment affects control group outcomes | Look for evidence of contamination; use designs that minimize contact |
| Hawthorne effects | Subjects change behavior because they know they are observed | Use double-blind designs; compare to administrative data |
| Demand effects | Subjects figure out the hypothesis and behave accordingly | Careful framing; use deception where ethical |
| Low power | Fail to detect real effects | Pre-registration with power analysis |
Differential Attrition
Attrition is 8% in treatment and 9% in control (no significant difference), and balance is maintained among non-attritors
ITT estimate: -0.05 ER visits (SE = 0.02). Lee bounds: [-0.08, -0.02]. Attrition does not threaten internal validity.
Non-Compliance Ignored in Analysis
Compliance is 25%. ITT is reported as the primary estimate; LATE is computed via IV using assignment as an instrument for take-up
ITT = -0.05 ER visits. LATE (for compliers) = -0.20. Both estimates are clearly labeled and interpreted.
SUTVA Violation (Spillovers)
Treatment and control groups are in separate villages with no interaction, so one group's treatment does not affect the other's outcomes
ITT = 0.15 SD improvement in test scores. No evidence of contamination between groups.
In the Oregon Health Insurance Experiment, about 25% of lottery winners actually enrolled in Medicaid. If the ITT estimate of the effect on emergency room visits is -0.05, what is the LATE?
H. Practice
A researcher runs an RCT but 30% of the treatment group does not take up the intervention. She drops non-compliers from the treatment group and compares the remaining treated individuals to the full control group. What is the problem?
In a cluster-randomized trial, 50 villages are assigned to treatment and 50 to control. A child in a treated village plays with untreated children from a neighboring control village, and the intervention's benefits spill over. What assumption is violated?
An experiment randomizes 500 students to tutoring (250) or control (250). After 6 months, 60 students in the treatment group and 15 in the control group have left the study. The researcher reports the ITT using only the remaining students. Should you be concerned?
A firm randomizes which customers receive a discount coupon. Customers who receive the coupon share it with their friends (who are in the control group). What is the likely effect on the ITT estimate?
You run an RCT of a tutoring program on test scores. 200 students are randomly assigned: 100 to tutoring, 100 to control. Of the 100 assigned to tutoring, 80 actually attend. The average test score in the treatment group (all 100) is 78 and in the control group is 72.
Calculate the ITT, the first-stage compliance rate, and the LATE.
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Read the analysis below carefully and identify the errors.
Select all errors you can find:
Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.
Paper Summary
The authors study whether providing information about calorie content at restaurants reduces calorie consumption. They randomize 80 restaurants in a large city: 40 display prominent calorie labels on menus, 40 serve as controls. After 6 months, they survey customers exiting each restaurant about their meal choices. They find that calorie labeling reduces average calories ordered by 45 kcal (SE = 18, p = 0.013). The first stage shows 95% compliance (38 of 40 treatment restaurants displayed labels). They report only the ITT.
Key Table
| Variable | Coefficient | SE | p-value |
|---|---|---|---|
| Assigned to labeling | -45.2 | 18.1 | 0.013 |
| Customer age | 2.1 | 0.8 | 0.009 |
| Customer female | -82.3 | 15.4 | 0.000 |
| Weekend visit | 67.8 | 14.2 | 0.000 |
| Restaurant FE | No | ||
| Clustered SEs | Restaurant | ||
| N (customers) | 12,400 |
Authors' Identification Claim
Random assignment of calorie labeling across restaurants ensures that the treatment and control groups are comparable in expectation, yielding an unbiased estimate of the effect of calorie information on ordering behavior.
I. Swap-In: When to Use Something Else
If randomization is infeasible (ethical constraints, cost, or lack of control), the closest alternatives are:
- Natural experiments — situations where nature or policy creates as-if random assignment. See IV / 2SLS and Regression Discontinuity.
- Matching — construct a comparison group that looks similar on observables.
- Difference-in-differences — exploit a policy change that affects some groups but not others.
For any of these approaches, sensitivity analysis is essential for assessing how robust your conclusions are to potential violations of identifying assumptions. The further you move from randomization, the more assumptions you need, and the less credible your causal claims become. But a well-designed quasi-experiment often beats a poorly executed RCT.