When should I use Experimental Design?

When you can randomly assign treatment to units, or when a natural lottery creates as-if random assignment. The benchmark for all other causal inference methods.

What is the key assumption of Experimental Design?

Random assignment (treatment independent of potential outcomes), SUTVA (no interference between units), and excludability (assignment affects outcomes only through treatment). With noncompliance, monotonicity is also needed for LATE.

What is the most common mistake with Experimental Design?

Conflating the intent-to-treat (ITT) effect with the treatment-on-the-treated (TOT) effect when there is noncompliance, or ignoring differential attrition between treatment and control groups.

Method·beginner·10 min read

Design-BasedEstablished

Experimental Design

The gold standard for internal validity — random assignment eliminates selection bias by design.

When to Use: When you can randomly assign treatment to units, or when a natural lottery creates as-if random assignment. The benchmark for all other causal inference methods.
Assumption: Random assignment (treatment independent of potential outcomes), SUTVA (no interference between units), and excludability (assignment affects outcomes only through treatment). With noncompliance, monotonicity is also needed for LATE.
Mistake: Conflating the intent-to-treat (ITT) effect with the treatment-on-the-treated (TOT) effect when there is noncompliance, or ignoring differential attrition between treatment and control groups.
Reading Time: ~10 min read · 11 sections · 10 interactive exercises

One-Line Implementation

Rfeols(outcome ~ treatment, data = df, vcov = 'HC1')

Statareg outcome treatment, vce(robust)

Pythonsmf.ols('outcome ~ treatment', data=df).fit(cov_type='HC1')

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: The Oregon Health Insurance Experiment

In 2008, Oregon reopened enrollment in its Medicaid program (OHP Standard) but had far more applicants than slots. The state held a lottery — a literal random draw — to decide who would get the opportunity to enroll. The lottery created one of the most important experiments in health economics (Finkelstein et al., 2012).

The researchers could compare lottery winners (who were offered insurance) to lottery losers. Because assignment was random, the two groups were identical in expectation on every dimension — income, health status, education, motivation, everything. Any difference in outcomes could be attributed to the insurance offer itself.

This balance is the power of experimental design. You typically do not need to measure and control for every confounder. Randomization handles it for you.

But here is the catch: not everyone who won the lottery actually enrolled in Medicaid. And this creates a gap between what was randomly assigned (the offer) and what was actually received (the insurance). Understanding this gap is one of the central lessons of this page.

AOverview

The is defined as:

\text{ATE} = E[Y(1) - Y(0)]

The fundamental problem is that we never observe both for the same unit. But random assignment solves the comparison problem — it eliminates selection bias in expectation, which is why causal inference requires careful research design. When treatment is randomly assigned:

E[Y_i(0) \mid D_i = 1] = E[Y_i(0) \mid D_i = 0]

In plain language: the average untreated outcome is the same for the treated group and the control group. The control group is a valid stand-in for the counterfactual. A simple difference in means recovers the ATE:

\hat{\tau} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}

This estimator gives an identical point estimate to OLS with a single treatment dummy. With homoscedastic errors and balanced groups, the standard errors are also identical; they differ under heteroscedasticity or imbalance, where robust SEs are preferred.

The Three Pillars of a Good Experiment

Random assignment — units are allocated to treatment and control by a mechanism the researcher controls.
No interference — one unit's treatment does not affect another unit's outcome (the , or SUTVA).
Excludability — the assignment mechanism affects outcomes only through the treatment itself, not through other channels.

Common Confusions

BIdentification

The Mechanics of Randomization

Randomization creates through a simple but powerful mechanism: it makes treatment assignment statistically independent of potential outcomes.

D_i \perp\!\!\!\perp (Y_i(1), Y_i(0))

This independence means there is no — a concept explored in depth on the selection bias foundations page. The people in the treatment group are, on average, identical to those in the control group in every way — observed and unobserved.

SUTVA (Stable Unit Treatment Value Assumption)

Randomization alone is not sufficient for identification. We also need the : no interference between units — one unit's treatment assignment does not affect another unit's outcomes — and no hidden variations of treatment. In the Oregon experiment, this means that one person winning the lottery did not change another person's health outcomes, and that Medicaid coverage was the same for all enrollees. SUTVA violations (such as ) can attenuate or inflate the estimated treatment effect even when randomization is intact.

SUTVA also implicitly rules out general equilibrium effects: scaling up an intervention that works in a small experiment may change market conditions (e.g., a job training program that works for 100 people may depress wages if applied to 100,000). This limitation means that experimentally estimated treatment effects may not survive extrapolation to policy-relevant scales.

When you compare outcomes by assignment (regardless of whether subjects actually took up the treatment), you get the ITT:

\text{ITT} = E[Y_i \mid Z_i = 1] - E[Y_i \mid Z_i = 0]

where $Z_i$ is the random assignment indicator. Under intact randomization, SUTVA (no spillovers between units, no hidden variations of treatment), and no differential attrition, the ITT identifies the average causal effect of being assigned to the treatment group.

LATE for Non-Compliance

In the Oregon experiment, $Z_i$ was winning the lottery, but $D_i$ (actually enrolling in Medicaid) was a choice. Some winners did not enroll (never-takers), and in principle, some non-winners might have found other ways to enroll (always-takers).

Using the lottery as an for actual enrollment, you can estimate the :

\text{LATE} = \frac{\text{ITT}_Y}{\text{ITT}_D} = \frac{E[Y_i \mid Z_i = 1] - E[Y_i \mid Z_i = 0]}{E[D_i \mid Z_i = 1] - E[D_i \mid Z_i = 0]}

This expression is the Wald ratio: reduced form ÷ first stage. The numerator ( $\text{ITT}_Y$ ) is the reduced-form effect of the instrument $Z$ on the outcome $Y$ ; the denominator ( $\text{ITT}_D$ ) is the first-stage effect of $Z$ on the treatment take-up $D$ . The ratio gives you the causal effect of treatment for — those whose treatment status was actually changed by the random assignment.

LATE Is Not ATE

The LATE tells you the effect for compliers, not for the full population. If the people who comply with the assignment are systematically different from those who do not, the LATE may not generalize. This local scope is a real limitation, not a minor technicality.

The LATE framework requires three identifying assumptions: (i) randomization of $Z$ (instrument independent of potential outcomes), (ii) the exclusion restriction — $Z$ affects $Y$ only through $D$ , and (iii) monotonicity — no individual does the opposite of their assignment (no ), ruling out units where $D_i(1)=0$ and $D_i(0)=1$ . Without any of the three, the Wald ratio does not identify a well-defined causal effect (Imbens & Angrist, 1994). Monotonicity is not testable — it is a restriction on counterfactual behavior that cannot be falsified from the data. You typically defend it by arguing that the encouragement could not have deterred anyone who would otherwise have taken up treatment.

CVisual Intuition

Think of randomization as a shuffling machine. You take your sample of people — with all their differences in motivation, ability, health, income — and you shuffle them into two groups completely at random. Each group ends up being a miniature copy of the other, on average.

The key visual: imagine a balance scale. Before randomization, the treatment group could be heavier on one side (more motivated people, higher income, whatever). After randomization, the scale is balanced — not perfectly for any single experiment, but in expectation across repeated randomizations.

This expectation is why balance tables matter. If your randomization worked, the treatment and control groups should look similar on all observed characteristics. A lets you verify this expectation.

DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: Random assignment makes the treated and control groups identical in expectation, so a simple comparison of group averages recovers the true causal effect.

Start with the observed difference in means:

\Delta = E[Y_i \mid D_i = 1] - E[Y_i \mid D_i = 0]

By the switching equation, the observed outcome is $Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0)$ . So:

\Delta = E[Y_i(1) \mid D_i = 1] - E[Y_i(0) \mid D_i = 0]

Now add and subtract $E[Y_i(0) | D_i = 1]$ :

\Delta = \underbrace{E[Y_i(1) - Y_i(0) \mid D_i = 1]}_{\text{ATT}} + \underbrace{E[Y_i(0) \mid D_i = 1] - E[Y_i(0) \mid D_i = 0]}_{\text{Selection bias}}

Under random assignment, $D_i \perp\!\!\!\perp (Y_i(1), Y_i(0))$ , so the selection bias term is zero:

E[Y_i(0) \mid D_i = 1] = E[Y_i(0) \mid D_i = 0] = E[Y_i(0)]

Therefore $\Delta = E[Y_i(1) - Y_i(0)] = \text{ATE}$ .

The ATT also equals the ATE under random assignment, because treatment is independent of potential outcomes.

EImplementation

1# Requires: fixest, modelsummary
2library(fixest)        # fixest: fast estimation with robust/clustered SEs
3library(modelsummary)  # modelsummary: publication-quality regression tables
4
5# --- Step 1: Balance table (verify randomization) ---
6# Regress each pre-treatment covariate on the treatment indicator
7# Small, insignificant coefficients = good randomization balance
8balance_vars <- c("age", "female", "income", "education")
9bal <- lapply(balance_vars, function(v) {
10feols(as.formula(paste(v, "~ treatment")), data = df)
11})
12# Focus on magnitudes (standardized differences), not just p-values
13modelsummary(bal, stars = TRUE)
14
15# --- Step 2: ITT estimate (intent-to-treat) ---
16# Simple difference in means: regress outcome on treatment assignment
17# ITT is valid under randomization alone — no compliance assumptions needed
18# vcov = "HC1": heteroskedasticity-robust (Huber-White) standard errors
19itt <- feols(outcome ~ treatment, data = df, vcov = "HC1")
20summary(itt)
21# Coefficient on treatment: causal effect of being ASSIGNED to treatment
22
23# --- Step 3: LATE via IV (for non-compliance) ---
24# When some assigned units do not take up treatment, ITT < true effect
25# feols IV syntax: outcome ~ exogenous | FE | endogenous ~ instrument
26# Uses random assignment as instrument for actual treatment takeup
27# LATE = effect on compliers (those who take up when assigned)
28late <- feols(outcome ~ 1 | 0 | takeup ~ assignment, data = df, vcov = "HC1")
29summary(late)
30# Coefficient on takeup: LATE (causal effect for compliers only)

Requiresfixest modelsummary

FDiagnostics

Balance Checks

A key diagnostic for any experiment. Compare pre-treatment covariates across treatment and control groups. Report:

Group means and standard deviations
Difference and its p-value (or standardized difference)
An F-test for joint significance of all covariates predicting treatment

(See the "Balance tables: what to look for" callout in Section E for guidance on standardized differences vs. p-values when assessing balance.)

Attrition Checks

(people dropping out of the study) is only a problem if it is differential — if treatment causes people to leave the sample at different rates. Check:

Is the attrition rate similar across treatment and control?
Among non-attritors, is balance still maintained?
Consider Lee bounds for worst-case scenarios (Lee, 2009).

Compliance Checks

Report the first-stage compliance rate: what fraction of the assigned-to-treatment group actually received treatment? A first-stage below 100% means you need to decide between ITT and LATE.

Interpreting Your Results

The ITT is a policy-relevant parameter: it tells you what happens when you roll out an intervention in practice, including non-compliance.
The LATE tells you what the treatment does for people who actually take it up, but it only applies to compliers.
If compliance is near 100%, ITT and LATE are approximately the same.
It is recommended to report the ITT. Report the LATE as a complement, not a replacement.

GWhat Can Go Wrong

Threat	What It Does	How to Diagnose
Non-compliance	Creates a gap between assignment and receipt	Report compliance rates; use LATE/IV
Attrition	Breaks random assignment if differential	Compare attrition rates; Lee bounds
Spillovers (SUTVA violation)	Treatment affects control group outcomes	Look for evidence of contamination; use designs that minimize contact
	Subjects change behavior because they know they are observed	Use double-blind designs; compare to administrative data
	Subjects figure out the hypothesis and behave accordingly	Careful framing; use deception where ethical
Low power	Fail to detect real effects	via pre-registration with power analysis

What Can Go Wrong

Differential Attrition

Attrition is 8% in treatment and 9% in control (no significant difference), and balance is maintained among non-attritors

ITT estimate: -0.05 ER visits (SE = 0.02). Lee bounds: [-0.08, -0.02]. Attrition does not threaten internal validity.

What Can Go Wrong

Non-Compliance Ignored in Analysis

Compliance is 25%. ITT is reported as the primary estimate; LATE is computed via IV using assignment as an instrument for take-up

ITT = -0.05 ER visits. LATE (for compliers) = -0.20. Both estimates are clearly labeled and interpreted.

What Can Go Wrong

SUTVA Violation (Spillovers)

Treatment and control groups are in separate villages with no interaction, so one group's treatment does not affect the other's outcomes

ITT = 0.15 SD improvement in test scores. No evidence of contamination between groups.

Concept Check

In a hypothetical lottery-based health insurance experiment, about 25% of lottery winners actually enrolled. If the ITT estimate on a health outcome is -0.05, what is the LATE?

-0.05-0.20-0.0125Cannot be determined

HPractice

Concept Check

A researcher runs a randomized controlled trial (RCT) but 30% of the treatment group does not take up the intervention. She drops non-compliers from the treatment group and compares the remaining treated individuals to the full control group. What is the problem?

Nothing — she is comparing people who actually received treatment to those who did not.Dropping non-compliers breaks random assignment and reintroduces selection bias.The sample size is too small after dropping.She should drop non-compliers from the control group too.

Concept Check

In a cluster-randomized trial, 50 villages are assigned to treatment and 50 to control. A child in a treated village plays with untreated children from a neighboring control village, and the intervention's benefits spill over. What assumption is violated?

Random assignment — the randomization was not done properly.Excludability — the instrument affects outcomes through channels other than treatment.SUTVA — one unit's treatment affects another unit's outcome.Intent-to-treat — the analysis should use assignment, not take-up.

Concept Check

An experiment randomizes 500 students to tutoring (250) or control (250). After 6 months, 60 students in the treatment group and 15 in the control group have left the study. The researcher reports the ITT using only the remaining students. Should you be concerned?

No — ITT analysis is always valid as long as you use the original assignment.Yes — differential attrition (24% vs. 6%) likely breaks the balance created by randomization.No — the total remaining sample (425) is still large enough for statistical power.Yes — but only because the overall attrition rate (15%) is too high.

Concept Check

A firm randomizes which customers receive a discount coupon. Customers who receive the coupon share it with their friends (who are in the control group). What is the likely effect on the ITT estimate?

The ITT is biased upward — treated customers buy more because of social pressure.The ITT is biased toward zero — control group outcomes improve, shrinking the treatment-control gap.The ITT is unaffected — ITT only uses the random assignment, not actual receipt.The ITT is biased away from zero — more people are treated, so the effect appears larger.

Guided Exercise

Calculate the ITT, the first-stage compliance rate, and the LATE.

You run a randomized controlled trial (RCT) of a tutoring program on test scores. 200 students are randomly assigned: 100 to tutoring, 100 to control. Of the 100 assigned to tutoring, 80 actually attend. The average test score in the treatment group (all 100) is 78 and in the control group is 72.

Error Detective

Read the analysis below carefully and identify the errors.

A health economist runs an RCT of a job training program on employment outcomes. 500 individuals are randomized: 250 to training, 250 to control. After 12 months, 40 participants in the treatment group and 10 in the control group have dropped out of the study. The researcher analyzes only the remaining participants and reports:

"The training program increased employment by 12 percentage points (p = 0.003). Because treatment was randomly assigned, this coefficient is a causal estimate free from selection bias. We find no evidence that attrition is a concern because our sample size remains large (N = 450)."

Select all errors you can find:

Differential attrition destroys randomization(Entire analysis — analyzing only non-attritors without addressing differential dropout)

Attrition assessment is about differential rates, not sample size(Last sentence about attrition not being a concern)

Error Detective

Read the analysis below carefully and identify the errors.

A development economist evaluates a conditional cash transfer (CCT) program. Villages are randomly assigned to treatment (receive CCT) or control. The researcher finds that treated villages have 15% higher school enrollment. They then want to estimate the effect on test scores, but test scores are only available for enrolled students. They report:

"Among enrolled students, treated villages score 2 points higher on standardized tests (p = 0.04). Combined with the enrollment effect, the CCT program improves both access to and quality of education."

Select all errors you can find:

Sample selection bias in the test score analysis(Test score comparison among enrolled students)

Conflating two incomparable estimates(Final sentence combining the two results)

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors study whether providing information about calorie content at restaurants reduces calorie consumption. They randomize 80 restaurants in a large city: 40 display prominent calorie labels on menus, 40 serve as controls. After 6 months, they survey customers exiting each restaurant about their meal choices. They find that calorie labeling reduces average calories ordered by 45 kcal (SE = 18, p = 0.013). The first stage shows 95% compliance (38 of 40 treatment restaurants displayed labels). They report only the ITT.

Key Table

Variable	Coefficient	SE	p-value
Assigned to labeling	-45.2	18.1	0.013
Customer age	2.1	0.8	0.009
Customer female	-82.3	15.4	0.000
Weekend visit	67.8	14.2	0.000
Restaurant FE	No
Clustered SEs	Restaurant
N (customers)	12,400

Authors' Identification Claim

Random assignment of calorie labeling across restaurants ensures that the treatment and control groups are comparable in expectation, yielding an unbiased estimate of the effect of calorie information on ordering behavior.

ISwap-In: When to Use Something Else

If randomization is infeasible (ethical constraints, cost, or lack of control), the closest alternatives are:

Natural experiments — situations where nature or policy creates as-if random assignment. See IV / 2SLS and Regression Discontinuity.
Matching — construct a comparison group that looks similar on observables.
Difference-in-differences — exploit a policy change that affects some groups but not others.

For any of these approaches, sensitivity analysis is important for assessing how robust your conclusions are to potential violations of identifying assumptions. The further you move from randomization, the more assumptions you need, and the less credible your causal claims become. But a well-designed quasi-experiment often beats a poorly executed RCT.

JReviewer Checklist

Critical Reading Checklist

0 of 8 items checked0%

Is the randomization mechanism clearly described?
Is a balance table presented showing pre-treatment covariate comparisons?
Are attrition rates reported for treatment and control, and is differential attrition tested?
Is the compliance rate reported? If non-compliance exists, is the ITT clearly distinguished from LATE?
Are standard errors appropriate (robust, clustered at the level of randomization)?
Is a pre-analysis plan registered, or are multiple hypothesis corrections applied?
Are SUTVA and no-spillover assumptions discussed?
Is external validity discussed — to whom do these results generalize?

Paper Library

Has replication code

Foundational (7)

Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1996.10476902

Angrist, Imbens, and Rubin formalize the LATE framework — originally introduced in Imbens and Angrist (1994) — within the Rubin Causal Model, providing a detailed treatment of the assumptions required for causal interpretation of IV estimates. This paper introduces the complier taxonomy (always-takers, never-takers, compliers, defiers) that is now standard in the IV literature. The practical implication is that IV estimates should be interpreted as local to the complier subpopulation, not as average effects for the entire population.

Athey, S., & Imbens, G. W. (2017). The Econometrics of Randomized Experiments.

Handbook of Economic Field ExperimentsDOI: 10.1016/bs.hefe.2016.10.003

Athey and Imbens provide a modern, rigorous treatment of the econometrics behind randomized experiments. They cover design, analysis, and inference issues such as stratification, clustering, and multiple hypothesis testing. It is an excellent reference for researchers running field experiments.

Bruhn, M., & McKenzie, D. (2009). In Pursuit of Balance: Randomization in Practice in Development Field Experiments.

American Economic Journal: Applied EconomicsDOI: 10.1257/app.1.4.200

Bruhn and McKenzie compare simple, stratified, pair-wise matched, and rerandomization designs in development field experiments. They find that gains from stratification and pair-wise matching over rerandomization are largest in small samples (under about 300) and with persistent outcomes, while in larger samples the methods perform similarly. They also call for clearer reporting of randomization procedures and provide guidance on which baseline variables to balance on and which controls to include in ex-post analysis.

Dunning, T. (2012). Natural Experiments in the Social Sciences: A Design-Based Approach.

Cambridge University PressDOI: 10.1017/CBO9781139084444

Dunning provides a systematic framework for identifying and analyzing natural experiments across the social sciences. The book covers as-if random assignment, instrumental variables, regression discontinuity, and difference-in-differences through a unified design-based lens, making it essential reading for researchers exploiting natural variation for causal inference.

Fisher, R. A. (1935). The Design of Experiments.

Oliver & Boyd

Fisher's classic book lays the foundations of experimental design, introducing concepts like randomization, blocking, and factorial designs. The 'lady tasting tea' example from this book remains one of the most famous illustrations of hypothesis testing and the logic of controlled experiments.

Harrison, G. W., & List, J. A. (2004). Field Experiments.

Journal of Economic LiteratureDOI: 10.1257/0022051043004577

Harrison and List provide an influential taxonomy of field experiments, distinguishing artefactual, framed, and natural field experiments from conventional lab experiments. The paper helps establish field experiments as a mainstream methodology in economics.

Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.

Journal of Educational PsychologyDOI: 10.1037/h0037350

Rubin formalizes the 'potential outcomes' framework that is now central to causal inference. The idea is simple but powerful: each unit has a potential outcome under treatment and under control, and the causal effect is the difference. This paper is the origin of what is now called the Rubin Causal Model.

Application (18)

Acquisti, A., & Fong, C. M. (2020). An Experiment in Hiring Discrimination via Online Social Networks.

Management ScienceDOI: 10.1287/mnsc.2018.3269

Acquisti and Fong conduct an audit study sending fictitious applications to over 4,000 US employers, randomly varying the social media profiles associated with applicants to signal religion and sexual orientation. At the national level they find no significant difference in callback rates for Muslim versus Christian or gay versus straight candidates, but document significant bias against Muslim candidates in Republican-leaning areas. The paper demonstrates how correspondence experiments can leverage online social networks to study discrimination in hiring.

Bandiera, O., Barankay, I., & Rasul, I. (2005). Social Preferences and the Response to Incentives: Evidence from Personnel Data.

Quarterly Journal of EconomicsDOI: 10.1093/qje/120.3.917

Bandiera, Barankay, and Rasul exploit a natural experiment in a fruit-picking firm that changed compensation from relative incentives to piece rates. Using personnel records, they document a 50% productivity increase under piece rates and trace the effect to reduced negative externalities among coworkers under relative pay.

Banerjee, A., Duflo, E., Goldberg, N., Karlan, D., Osei, R., Pariente, W., Shapiro, J., Thuysbaert, B., & Udry, C. (2015). A Multifaceted Program Causes Lasting Progress for the Very Poor: Evidence from Six Countries.

ScienceDOI: 10.1126/science.1260799

Banerjee, Duflo, and colleagues conduct a large-scale RCT across six countries, demonstrating that a multifaceted anti-poverty program produces sustained economic gains for the ultra-poor. The study is notable for its multi-site design, which provides rare multi-country evidence on how the same intervention performs across diverse contexts. It demonstrates both the power of randomized evaluation at scale and the importance of bundled interventions when individual components may be insufficient.

Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.

American Economic ReviewDOI: 10.1257/0002828042002561

Bertrand and Mullainathan send fictitious resumes with randomly assigned names to employers and find that 'white-sounding' names receive 50% more callbacks in this famous audit study. It is one of the most widely cited field experiments in social science and a powerful example of how randomization can identify discrimination.

Bloom, N., Liang, J., Roberts, J., & Ying, Z. J. (2015). Does Working from Home Work? Evidence from a Chinese Experiment.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qju032

Bloom and colleagues conduct a large-scale randomized experiment at a Chinese travel agency, finding that working from home led to a 13% performance increase. The study becomes a landmark reference in management and labor economics for its clean experimental design applied to a practical workplace question.

Camuffo, A., Cordova, A., Gambardella, A., & Spina, C. (2020). A Scientific Approach to Entrepreneurial Decision Making: Evidence from a Randomized Control Trial.

Management ScienceDOI: 10.1287/mnsc.2018.3249

Camuffo and colleagues conduct a randomized controlled trial with 116 Italian startups, randomly assigning half to receive training in a 'scientific' approach to entrepreneurial decision-making (formulating and testing hypotheses before committing resources). Treated startups perform better, are more likely to pivot, and are not more likely to drop out, providing experimental evidence that structured decision-making improves entrepreneurial outcomes.

Camuffo, A., Gambardella, A., Messinese, D., Novelli, E., Paolucci, E., & Spina, C. (2024). A Scientific Approach to Entrepreneurial Decision-Making: Large-Scale Replication and Extension.

Strategic Management JournalDOI: 10.1002/smj.3580

Camuffo and colleagues conduct four randomized controlled trials with 759 firms across Italy, the UK, and India, replicating and extending their earlier finding that training entrepreneurs to adopt a 'scientific' approach to decision-making improves venture performance. The multi-site, multi-country design provides strong evidence on the external validity of the original RCT findings.

Chatterji, A. K., Findley, M., Jensen, N. M., Meier, S., & Nielson, D. (2016). Field Experiments in Strategy Research.

Strategic Management JournalDOI: 10.1002/smj.2449

Chatterji, Findley, Jensen, Meier, and Nielson make the case for using field experiments in strategy research and provide practical guidance for doing so. They discuss internal validity, external validity, and ethical considerations specific to strategy scholars.

Crépon, B., Duflo, E., Gurgand, M., Rathelot, R., & Zamora, P. (2013). Do Labor Market Policies Have Displacement Effects? Evidence from a Clustered Randomized Experiment.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjt001

Crépon and colleagues evaluate a job placement assistance program in France using a two-step clustered randomization design that varies treatment intensity across 235 labor markets. The paper's key contribution is identifying displacement effects: treated job seekers gain at the expense of untreated competitors, particularly in weak labor markets and among workers with similar skills. This innovative experimental design allows estimation of both direct and indirect (general equilibrium) effects of active labor market policies.

Finkelstein, A., Taubman, S., Wright, B., Bernstein, M., Gruber, J., Newhouse, J. P., Allen, H., Baicker, K., & The Oregon Health Study Group (2012). The Oregon Health Insurance Experiment: Evidence from the First Year.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjs020

Finkelstein and colleagues analyze the Oregon Health Insurance Experiment, in which uninsured low-income adults were selected by lottery for the chance to apply for Medicaid. Using this randomized controlled design with IV to handle noncompliance, they estimate the local average treatment effect of Medicaid coverage on health care utilization, financial strain, and self-reported health. The study demonstrates the practical difference between intent-to-treat and LATE estimates in a real-world experiment where not all lottery winners enrolled.

Friebel, G., Heinz, M., & Zubanov, N. (2022). Middle Managers, Personnel Turnover, and Performance: A Long-Term Field Experiment in a Retail Chain.

Management ScienceDOI: 10.1287/mnsc.2020.3905

Friebel, Heinz, and Zubanov conduct a long-term randomized field experiment in a large Eastern European retail chain, in which the CEO asked treated store managers to reduce employee quit rates. The intervention decreased the quit rate by a fifth to a quarter, lasting nine months before petering out, but reappearing after a reminder. However, there was no treatment effect on sales, illustrating that reducing turnover does not automatically translate into improved store performance.

Gornall, W., & Strebulaev, I. A. (2025). Gender, Race, and Entrepreneurship: A Randomized Field Experiment on Venture Capitalists and Angels.

Management ScienceDOI: 10.1287/mnsc.2024.4990

Gornall and Strebulaev conduct a large-scale correspondence experiment, sending approximately 80,000 pitch emails from fictitious startups to 28,000 venture capitalists and angel investors. By randomly varying the entrepreneur's name to signal gender and race, they find that female entrepreneurs received 9% more interested replies and Asian-surname entrepreneurs received 6% more responses than White-surname entrepreneurs, indicating favorable rather than adverse bias. The paper provides large-scale experimental evidence on investor response patterns by entrepreneur demographics in entrepreneurial finance.

Grant, A. M. (2008). The Significance of Task Significance: Job Performance Effects, Relational Mechanisms, and Boundary Conditions.

Journal of Applied PsychologyDOI: 10.1037/0021-9010.93.1.108

Grant conducts field experiments showing that briefly exposing workers to the beneficiaries of their work significantly increased their motivation and performance. This paper is a well-known example of experimental design applied within organizational behavior research.

Hoogendoorn, S., Parker, S. C., & van Praag, M. (2017). Smart or Diverse Start-up Teams? Evidence from a Field Experiment.

Organization ScienceDOI: 10.1287/orsc.2017.1158

Hoogendoorn, Parker, and van Praag conduct a field experiment with 573 students randomly assigned to 49 startup teams that varied in cognitive ability dispersion. They find an inverted U-shaped relationship between ability dispersion and team performance, with moderately diverse teams in ability outperforming both homogeneous and highly dispersed teams. The random assignment to teams ensures that ability composition is exogenous, providing clean experimental identification of the effect of team cognitive diversity on venture performance.

Hurst, R., Lee, S., & Frake, J. (2024). The Effect of Flatter Hierarchy on Applicant Pool Gender Diversity: Evidence from Experiments.

Strategic Management JournalDOI: 10.1002/smj.3590

Hurst, Lee, and Frake conduct a reverse audit study in partnership with a U.S. healthcare startup, sending recruitment emails to approximately 8,400 job seekers with randomly varied descriptions of the firm's organizational hierarchy. Featuring a flatter hierarchy did not significantly affect applicant pool size but significantly decreased women's representation, because women perceived flatter structures as offering fewer career advancement opportunities and greater workload burdens.

Jia, N., Luo, X., Fang, Z., & Liao, C. (2024). When and How Artificial Intelligence Augments Employee Creativity.

Academy of Management JournalDOI: 10.5465/amj.2022.0426

Jia and colleagues conduct a randomized field experiment at a telemarketing company that varies AI augmentation across employees and examines how the effect on creativity depends on task characteristics. The design illustrates how field experiments paired with task-level heterogeneity analysis can identify not just the average effect of AI but also where the technology is most complementary to human creative work.

Kang, S. K., DeCelles, K. A., Tilcsik, A., & Jun, S. (2016). Whitened Résumés: Race and Self-Presentation in the Labor Market.

Administrative Science QuarterlyDOI: 10.1177/0001839216639577

Kang and colleagues combine qualitative interviews with minority applicants, a laboratory experiment, and a résumé audit study sending fictitious applications to real employers. They find that minority applicants who 'whitened' their résumés received significantly more callbacks, providing a powerful mixed-methods example of how audit studies can identify discrimination in hiring.

Pongeluppe, L. S. (2024). The Allegory of the Favela: The Multifaceted Effects of Socioeconomic Mobility.

Administrative Science QuarterlyDOI: 10.1177/00018392241240469

Pongeluppe conducts a randomized controlled trial of a business training program offered to residents of Brazilian favelas, complementing the experiment with quantile regressions, field visits, and interviews. The results show that training improves economic outcomes such as income and entrepreneurship participation, but also intensifies participants' experiences of favela-related stigma, revealing that socioeconomic mobility can simultaneously generate material benefits and psychosocial costs.

Survey (5)

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.

Duflo, E., Glennerster, R., & Kremer, M. (2007). Using Randomization in Development Economics Research: A Toolkit.

Handbook of Development EconomicsDOI: 10.1016/S1573-4471(07)04061-2

Duflo, Glennerster, and Kremer write a comprehensive practical guide to running randomized experiments in development economics. The chapter covers all stages from design to analysis, including power calculations, stratification, dealing with attrition, and estimating treatment effects with imperfect compliance. It has become required reading for anyone designing a field experiment.

Gerber, A. S., & Green, D. P. (2012). Field Experiments: Design, Analysis, and Interpretation.

W. W. Norton

Gerber and Green write a comprehensive textbook on field experiments covering randomization, blocking, clustering, noncompliance, and attrition. The book provides rigorous treatment of experimental design principles with practical guidance drawn from political science and public policy applications. It is particularly valuable for its coverage of complications that arise in real-world experiments, including how to handle noncompliance through intent-to-treat analysis and instrumental variables.

List, J. A., Sadoff, S., & Wagner, M. (2011). So You Want to Run an Experiment, Now What? Some Simple Rules of Thumb for Optimal Experimental Design.

Experimental EconomicsDOI: 10.1007/s10683-011-9275-7

List, Sadoff, and Wagner provide rules of thumb for sample size, treatment assignment, and other design decisions in field experiments in this practical guide. It is a useful starting point for researchers planning their first experiment.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.

MIT Press

Wooldridge's graduate textbook covers duration and hazard models in Chapter 22, including the Cox proportional hazard model, parametric alternatives (Weibull, exponential), and the treatment of censoring and truncation in survival data.

One-Line Implementation

Download Full Analysis Code

Motivating Example: The Oregon Health Insurance Experiment#

AOverview#

The Three Pillars of a Good Experiment#

Common Confusions#

BIdentification#

The Mechanics of Randomization#

SUTVA (Stable Unit Treatment Value Assumption)#

Intent-to-Treat (ITT)#

LATE for Non-Compliance#

CVisual Intuition#

DMathematical Derivation#

EImplementation#

FDiagnostics#

Balance Checks#

Attrition Checks#

Compliance Checks#

Interpreting Your Results#

GWhat Can Go Wrong#

Differential Attrition

Non-Compliance Ignored in Analysis

SUTVA Violation (Spillovers)

HPractice#

Paper Summary

Key Table

Authors' Identification Claim

ISwap-In: When to Use Something Else#

JReviewer Checklist#

Critical Reading Checklist

Paper Library

Foundational (7)

Application (18)

Survey (5)

Tags

Motivating Example: The Oregon Health Insurance Experiment

AOverview

The Three Pillars of a Good Experiment

Common Confusions

BIdentification

The Mechanics of Randomization

SUTVA (Stable Unit Treatment Value Assumption)

LATE for Non-Compliance

CVisual Intuition

DMathematical Derivation

EImplementation

FDiagnostics

Balance Checks

Attrition Checks

Compliance Checks

Interpreting Your Results

GWhat Can Go Wrong

HPractice

ISwap-In: When to Use Something Else

JReviewer Checklist