MethodAtlas
Design-BasedEstablished

Experimental Design

The benchmark for causal inference — random assignment eliminates selection bias by design.

Quick Reference

When to Use
When you can randomly assign treatment to units, or when a natural lottery creates as-if random assignment. The benchmark for all other causal inference methods.
Key Assumption
Random assignment (treatment independent of potential outcomes), SUTVA (no interference between units), and excludability (assignment affects outcomes only through treatment). With noncompliance, monotonicity is also needed for LATE.
Common Mistake
Conflating the intent-to-treat (ITT) effect with the treatment-on-the-treated (TOT) effect when there is noncompliance, or ignoring differential attrition between treatment and control groups.
Estimated Time
2 hours

One-Line Implementation

Stata: reg outcome treatment, vce(robust)
R: feols(outcome ~ treatment, data = df, vcov = 'HC1')
Python: smf.ols('outcome ~ treatment', data=df).fit(cov_type='HC1')

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: The Oregon Health Insurance Experiment

In 2008, Oregon expanded its Medicaid program but had far more applicants than slots. The state held a lottery — a literal random draw — to decide who would get the opportunity to enroll. The lottery created one of the most important experiments in health economics.

(Finkelstein et al., 2012)

The researchers could compare lottery winners (who were offered insurance) to lottery losers. Because assignment was random, the two groups were identical in expectation on every dimension — income, health status, education, motivation, everything. Any difference in outcomes could be attributed to the insurance offer itself.

This balance is the power of experimental design. You do not need to measure and control for every confounder. Randomization handles it for you.

But here is the catch: not everyone who won the lottery actually enrolled in Medicaid. And this non-compliance creates a gap between what was randomly assigned (the offer) and what was actually received (the insurance). Understanding this gap is one of the central lessons of this page.


A. Overview: Why Experiments Are the Benchmark for Causal Inference

The is defined as:

ATE=E[Yi(1)Yi(0)]\text{ATE} = E[Y_i(1) - Y_i(0)]

The fundamental problem is that we never observe both for the same unit. But random assignment solves the comparison problem — it eliminates selection bias in expectation, which is why causal inference requires careful research design. When treatment is randomly assigned:

E[Yi(0)Di=1]=E[Yi(0)Di=0]E[Y_i(0) | D_i = 1] = E[Y_i(0) | D_i = 0]

In plain language: the average untreated outcome is the same for the treated group and the control group. The control group is a valid stand-in for the counterfactual. A simple difference in means recovers the ATE:

τ^=YˉtreatedYˉcontrol\hat{\tau} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}

This estimator is exactly equivalent to running OLS with a single treatment dummy. The regression just gives you standard errors for free.

The Three Pillars of a Good Experiment

  1. Random assignment — units are allocated to treatment and control by a mechanism the researcher controls.
  2. No interference — one unit's treatment does not affect another unit's outcome (the Stable Unit Treatment Value Assumption, or SUTVA).
  3. Excludability — the assignment mechanism affects outcomes only through the treatment itself, not through other channels.

Common Confusions


B. Identification: What Makes Randomization Work

The Mechanics of Randomization

Randomization creates through a simple but powerful mechanism: it makes treatment assignment statistically independent of potential outcomes.

Di ⁣ ⁣ ⁣(Yi(1),Yi(0))D_i \perp\!\!\!\perp (Y_i(1), Y_i(0))

This independence means there is no — a concept explored in depth on the selection bias foundations page. The people in the treatment group are, on average, identical to those in the control group in every way — observed and unobserved.

Intent-to-Treat (ITT)

When you compare outcomes by assignment (regardless of whether subjects actually took up the treatment), you get the ITT:

ITT=E[YiZi=1]E[YiZi=0]\text{ITT} = E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0]

where ZiZ_i is the random assignment indicator. Under intact randomization and no differential attrition, the ITT is a valid causal effect. It answers: "What is the effect of being assigned to the treatment group?"

LATE for Non-Compliance

In the Oregon experiment, ZiZ_i was winning the lottery, but DiD_i (actually enrolling in Medicaid) was a choice. Some winners did not enroll (never-takers), and in principle, some non-winners might have found other ways to enroll (always-takers).

Using the lottery as an for actual enrollment, you can estimate the :

LATE=ITTYITTD=E[YiZi=1]E[YiZi=0]E[DiZi=1]E[DiZi=0]\text{LATE} = \frac{\text{ITT}_Y}{\text{ITT}_D} = \frac{E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0]}{E[D_i | Z_i = 1] - E[D_i | Z_i = 0]}

This expression is the Wald ratio: reduced form ÷ first stage. The numerator (ITTY\text{ITT}_Y) is the reduced-form effect of the instrument ZZ on the outcome YY; the denominator (ITTD\text{ITT}_D) is the first-stage effect of ZZ on the treatment take-up DD. The ratio gives you the causal effect of treatment for — those whose treatment status was actually changed by the random assignment.


C. Visual Intuition

Think of randomization as a shuffling machine. You take your sample of people — with all their differences in motivation, ability, health, income — and you shuffle them into two groups completely at random. Each group ends up being a miniature copy of the other, on average.

The key visual: imagine a balance scale. Before randomization, the treatment group could be heavier on one side (more motivated people, higher income, whatever). After randomization, the scale is balanced — not perfectly for any single experiment, but in expectation across repeated randomizations.

This expectation is why balance tables matter. If your randomization worked, the treatment and control groups should look similar on all observed characteristics. A balance table lets you verify this expectation.

Interactive Simulation

Treatment Effect Under Randomization

As sample size grows, the estimated treatment effect converges to the true effect and the p-value shrinks. Noise inflates sampling variance but randomization keeps the estimator unbiased.

-4.02-1.960.102.154.21Test Statistic (Difference in Means)0336699CountObserved: 3.176p = 0.007Observed StatisticTail (p-value region)Permutation Null

Computed Results

Estimated Effect (mean)
3.00
Std. Error of Estimator
0.707
t-statistic
4.24
010
202000
120
Interactive Simulation

Wald Ratio Builder

The Local Average Treatment Effect (LATE) equals the reduced-form effect on the outcome (ITTY) divided by the first-stage effect on treatment take-up (ITTD). Drag the sliders to see how the Wald ratio changes.

Coefficient value03.8-3.82.00ITTY0.70ITTD2.86LATE
2.0

Intent-to-treat effect on the outcome Y

0.70

Share of compliers (first-stage coefficient)

Wald Ratio

LATE = ITTY / ITTD

2.00 / 0.70 = 2.857

Key insight: The Wald estimator rescales the reduced-form ITT effect by the compliance rate. If only 50% of people assigned to treatment actually take it (ITTD = 0.5), the LATE is twice the ITT. This identifies the causal effect for compliers only — those whose treatment status is affected by the instrument. When the first stage is weak, the denominator is near zero, inflating both the point estimate and its variance.

Interactive Simulation

Why Randomization?

DGP: Y = 2.0·D + 2·U + ε. In the randomized arm, D is coin-flip assigned. In the observational arm, P(D=1) = sigmoid(1.5·U), so U confounds the treatment-outcome relationship. N = 200 per arm.

Randomized-5.8-2.01.95.79.6ControlTreatedTreatment DSelf-Selected-7.0-3.10.94.88.8ControlTreatedTreatment D
True effectTreatedControl

Estimation Results

Estimatorβ̂SE95% CIBias
OLS (randomized)closest2.2190.320[1.59, 2.85]+0.219
OLS (self-selected)4.0390.250[3.55, 4.53]+2.039
True β2.000
200

Observations in each arm

2.0

The causal effect of D on Y

1.5

How strongly U drives self-selection into treatment (0 = no confounding)

Why the difference?

With self-selection (confounding = 1.5), OLS is biased by +2.04 because people who choose to be treated have systematically different unobserved characteristics that also affect the outcome. Randomization eliminates this bias entirely: by assigning treatment randomly, it guarantees that treated and control groups are comparable on average (OLS bias = +0.22).


D. Mathematical Derivation

Don't worry about the notation yet — here's what this means in words: Random assignment makes the treated and control groups identical in expectation, so a simple comparison of group averages recovers the true causal effect.

Start with the observed difference in means:

Δ=E[YiDi=1]E[YiDi=0]\Delta = E[Y_i | D_i = 1] - E[Y_i | D_i = 0]

By the switching equation, the observed outcome is Yi=DiYi(1)+(1Di)Yi(0)Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0). So:

Δ=E[Yi(1)Di=1]E[Yi(0)Di=0]\Delta = E[Y_i(1) | D_i = 1] - E[Y_i(0) | D_i = 0]

Now add and subtract E[Yi(0)Di=1]E[Y_i(0) | D_i = 1]:

Δ=E[Yi(1)Yi(0)Di=1]ATT+E[Yi(0)Di=1]E[Yi(0)Di=0]Selection Bias\Delta = \underbrace{E[Y_i(1) - Y_i(0) | D_i = 1]}_{\text{ATT}} + \underbrace{E[Y_i(0) | D_i = 1] - E[Y_i(0) | D_i = 0]}_{\text{Selection Bias}}

Under random assignment, Di ⁣ ⁣ ⁣(Yi(1),Yi(0))D_i \perp\!\!\!\perp (Y_i(1), Y_i(0)), so the selection bias term is zero:

E[Yi(0)Di=1]=E[Yi(0)Di=0]=E[Yi(0)]E[Y_i(0) | D_i = 1] = E[Y_i(0) | D_i = 0] = E[Y_i(0)]

Therefore Δ=E[Yi(1)Yi(0)]=ATE\Delta = E[Y_i(1) - Y_i(0)] = \text{ATE}.

The ATT also equals the ATE under random assignment, because treatment is independent of potential outcomes.


E. Implementation

library(fixest)
library(modelsummary)

# ---- Balance table ----
# Regress each covariate on treatment to verify randomization worked
balance_vars <- c("age", "female", "income", "education")
bal <- lapply(balance_vars, function(v) {
feols(as.formula(paste(v, "~ treatment")), data = df)
})
# Display balance results; small coefficients = good randomization
modelsummary(bal, stars = TRUE)

# ---- ITT estimate ----
# Simple regression of outcome on treatment assignment
# vcov = "HC1" gives heteroskedasticity-robust standard errors
itt <- feols(outcome ~ treatment, data = df, vcov = "HC1")
summary(itt)

# ---- LATE via IV (for non-compliance) ----
# outcome ~ exogenous | fixed_effects | endogenous ~ instrument
# Uses assignment as instrument for actual takeup
late <- feols(outcome ~ 1 | 0 | takeup ~ assignment, data = df, vcov = "HC1")
summary(late)

F. Diagnostics and Robustness Checks

Balance Checks

An essential diagnostic for any experiment. Compare pre-treatment covariates across treatment and control groups. Report:

  • Group means and standard deviations
  • Difference and its p-value (or standardized difference)
  • An F-test for joint significance of all covariates predicting treatment

Attrition Checks

Attrition (people dropping out of the study) is only a problem if it is differential — if treatment causes people to leave the sample at different rates. Check:

  1. Is the attrition rate similar across treatment and control?
  2. Among non-attritors, is balance still maintained?
  3. Consider Lee bounds for worst-case scenarios.
(Lee, 2009)

Compliance Checks

Report the first-stage compliance rate: what fraction of the assigned-to-treatment group actually received treatment? A first-stage below 100% means you need to decide between ITT and LATE.


Interpreting Results

  • The ITT is a policy-relevant parameter: it tells you what happens when you roll out an intervention in practice, including non-compliance.
  • The LATE tells you what the treatment does for people who actually take it up, but it only applies to compliers.
  • If compliance is near 100%, ITT and LATE are essentially the same.
  • It is recommended to report the ITT. Report the LATE as a complement, not a replacement.

G. What Can Go Wrong

ThreatWhat It DoesHow to Diagnose
Non-complianceCreates a gap between assignment and receiptReport compliance rates; use LATE/IV
AttritionBreaks random assignment if differentialCompare attrition rates; Lee bounds
Spillovers (SUTVA violation)Treatment affects control group outcomesLook for evidence of contamination; use designs that minimize contact
Hawthorne effectsSubjects change behavior because they know they are observedUse double-blind designs; compare to administrative data
Demand effectsSubjects figure out the hypothesis and behave accordinglyCareful framing; use deception where ethical
Low powerFail to detect real effectsPre-registration with power analysis
Assumption Failure Demo

Differential Attrition

Attrition is 8% in treatment and 9% in control (no significant difference), and balance is maintained among non-attritors

ITT estimate: -0.05 ER visits (SE = 0.02). Lee bounds: [-0.08, -0.02]. Attrition does not threaten internal validity.

Assumption Failure Demo

Non-Compliance Ignored in Analysis

Compliance is 25%. ITT is reported as the primary estimate; LATE is computed via IV using assignment as an instrument for take-up

ITT = -0.05 ER visits. LATE (for compliers) = -0.20. Both estimates are clearly labeled and interpreted.

Assumption Failure Demo

SUTVA Violation (Spillovers)

Treatment and control groups are in separate villages with no interaction, so one group's treatment does not affect the other's outcomes

ITT = 0.15 SD improvement in test scores. No evidence of contamination between groups.

Concept Check

In the Oregon Health Insurance Experiment, about 25% of lottery winners actually enrolled in Medicaid. If the ITT estimate of the effect on emergency room visits is -0.05, what is the LATE?


H. Practice

Concept Check

A researcher runs an RCT but 30% of the treatment group does not take up the intervention. She drops non-compliers from the treatment group and compares the remaining treated individuals to the full control group. What is the problem?

Concept Check

In a cluster-randomized trial, 50 villages are assigned to treatment and 50 to control. A child in a treated village plays with untreated children from a neighboring control village, and the intervention's benefits spill over. What assumption is violated?

Concept Check

An experiment randomizes 500 students to tutoring (250) or control (250). After 6 months, 60 students in the treatment group and 15 in the control group have left the study. The researcher reports the ITT using only the remaining students. Should you be concerned?

Concept Check

A firm randomizes which customers receive a discount coupon. Customers who receive the coupon share it with their friends (who are in the control group). What is the likely effect on the ITT estimate?

Guided Exercise

You run an RCT of a tutoring program on test scores. 200 students are randomly assigned: 100 to tutoring, 100 to control. Of the 100 assigned to tutoring, 80 actually attend. The average test score in the treatment group (all 100) is 78 and in the control group is 72.

Calculate the ITT, the first-stage compliance rate, and the LATE.

What is the ITT estimate?

What is the first-stage compliance rate?

What is the LATE?

Error Detective

Read the analysis below carefully and identify the errors.

A health economist runs an RCT of a job training program on employment outcomes. 500 individuals are randomized: 250 to training, 250 to control. After 12 months, 40 participants in the treatment group and 10 in the control group have dropped out of the study. The researcher analyzes only the remaining participants and reports: "The training program increased employment by 12 percentage points (p = 0.003). Because treatment was randomly assigned, this coefficient is a causal estimate free from selection bias. We find no evidence that attrition is a concern because our sample size remains large (N = 450)."

Select all errors you can find:

Error Detective

Read the analysis below carefully and identify the errors.

A development economist evaluates a conditional cash transfer (CCT) program. Villages are randomly assigned to treatment (receive CCT) or control. The researcher finds that treated villages have 15% higher school enrollment. They then want to estimate the effect on test scores, but test scores are only available for enrolled students. They report: "Among enrolled students, treated villages score 2 points higher on standardized tests (p = 0.04). Combined with the enrollment effect, the CCT program improves both access to and quality of education."

Select all errors you can find:

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors study whether providing information about calorie content at restaurants reduces calorie consumption. They randomize 80 restaurants in a large city: 40 display prominent calorie labels on menus, 40 serve as controls. After 6 months, they survey customers exiting each restaurant about their meal choices. They find that calorie labeling reduces average calories ordered by 45 kcal (SE = 18, p = 0.013). The first stage shows 95% compliance (38 of 40 treatment restaurants displayed labels). They report only the ITT.

Key Table

VariableCoefficientSEp-value
Assigned to labeling-45.218.10.013
Customer age2.10.80.009
Customer female-82.315.40.000
Weekend visit67.814.20.000
Restaurant FENo
Clustered SEsRestaurant
N (customers)12,400

Authors' Identification Claim

Random assignment of calorie labeling across restaurants ensures that the treatment and control groups are comparable in expectation, yielding an unbiased estimate of the effect of calorie information on ordering behavior.


I. Swap-In: When to Use Something Else

If randomization is infeasible (ethical constraints, cost, or lack of control), the closest alternatives are:

For any of these approaches, sensitivity analysis is essential for assessing how robust your conclusions are to potential violations of identifying assumptions. The further you move from randomization, the more assumptions you need, and the less credible your causal claims become. But a well-designed quasi-experiment often beats a poorly executed RCT.


J. Reviewer Checklist

Critical Reading Checklist



Tags

design-basedrandomizationgold-standard