MethodAtlas
Method·beginner·11 min read
Design-BasedEstablished

Experimental Design

The gold standard for internal validity — random assignment eliminates selection bias by design.

When to UseWhen you can randomly assign treatment to units, or when a natural lottery creates as-if random assignment. The benchmark for all other causal inference methods.
AssumptionRandom assignment (treatment independent of potential outcomes), SUTVA (no interference between units), and excludability (assignment affects outcomes only through treatment). With noncompliance, monotonicity is also needed for LATE.
MistakeConflating the intent-to-treat (ITT) effect with the treatment-on-the-treated (TOT) effect when there is noncompliance, or ignoring differential attrition between treatment and control groups.
Reading Time~11 min read · 11 sections · 10 interactive exercises

One-Line Implementation

Rfeols(outcome ~ treatment, data = df, vcov = 'HC1')
Statareg outcome treatment, vce(robust)
Pythonsmf.ols('outcome ~ treatment', data=df).fit(cov_type='HC1')

Download Full Analysis Code

Complete scripts with diagnostics, robustness checks, and result export.

Motivating Example: The Oregon Health Insurance Experiment

In 2008, Oregon reopened enrollment in its Medicaid program (OHP Standard) but had far more applicants than slots. The state held a lottery — a literal random draw — to decide who would get the opportunity to enroll. The lottery created one of the most important experiments in health economics (Finkelstein et al., 2012).

The researchers could compare lottery winners (who were offered insurance) to lottery losers. Because assignment was random, the two groups were identical in expectation on every dimension — income, health status, education, motivation, everything. Any difference in outcomes could be attributed to the insurance offer itself.

This balance is the power of experimental design. You do not need to measure and control for every confounder. Randomization handles it for you.

But here is the catch: not everyone who won the lottery actually enrolled in Medicaid. And this creates a gap between what was randomly assigned (the offer) and what was actually received (the insurance). Understanding this gap is one of the central lessons of this page.


AOverview

The is defined as:

ATE=E[Y(1)Y(0)]\text{ATE} = E[Y(1) - Y(0)]

The fundamental problem is that we never observe both for the same unit. But random assignment solves the comparison problem — it eliminates selection bias in expectation, which is why causal inference requires careful research design. When treatment is randomly assigned:

E[Yi(0)Di=1]=E[Yi(0)Di=0]E[Y_i(0) \mid D_i = 1] = E[Y_i(0) \mid D_i = 0]

In plain language: the average untreated outcome is the same for the treated group and the control group. The control group is a valid stand-in for the counterfactual. A simple difference in means recovers the ATE:

τ^=YˉtreatedYˉcontrol\hat{\tau} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}

This estimator gives an identical point estimate to OLS with a single treatment dummy. With homoscedastic errors and balanced groups, the standard errors are also identical; they differ under heteroscedasticity or imbalance, where robust SEs are preferred.

The Three Pillars of a Good Experiment

  1. Random assignment — units are allocated to treatment and control by a mechanism the researcher controls.
  2. No interference — one unit's treatment does not affect another unit's outcome (the , or SUTVA).
  3. Excludability — the assignment mechanism affects outcomes only through the treatment itself, not through other channels.

Common Confusions


BIdentification

The Mechanics of Randomization

Randomization creates through a simple but powerful mechanism: it makes treatment assignment statistically independent of potential outcomes.

Di ⁣ ⁣ ⁣(Yi(1),Yi(0))D_i \perp\!\!\!\perp (Y_i(1), Y_i(0))

This independence means there is no — a concept explored in depth on the selection bias foundations page. The people in the treatment group are, on average, identical to those in the control group in every way — observed and unobserved.

SUTVA (Stable Unit Treatment Value Assumption)

Randomization alone is not sufficient for identification. We also need the : no interference between units — one unit's treatment assignment does not affect another unit's outcomes — and no hidden variations of treatment. In the Oregon experiment, this means that one person winning the lottery did not change another person's health outcomes, and that Medicaid coverage was the same for all enrollees. SUTVA violations (such as ) can attenuate or inflate the estimated treatment effect even when randomization is intact.

SUTVA also implicitly rules out general equilibrium effects: scaling up an intervention that works in a small experiment may change market conditions (e.g., a job training program that works for 100 people may depress wages if applied to 100,000). This limitation means that experimentally estimated treatment effects may not survive extrapolation to policy-relevant scales.

When you compare outcomes by assignment (regardless of whether subjects actually took up the treatment), you get the ITT:

ITT=E[YiZi=1]E[YiZi=0]\text{ITT} = E[Y_i \mid Z_i = 1] - E[Y_i \mid Z_i = 0]

where ZiZ_i is the random assignment indicator. Under intact randomization and no differential attrition, the ITT is a valid causal effect. It answers: "What is the effect of being assigned to the treatment group?"

LATE for Non-Compliance

In the Oregon experiment, ZiZ_i was winning the lottery, but DiD_i (actually enrolling in Medicaid) was a choice. Some winners did not enroll (never-takers), and in principle, some non-winners might have found other ways to enroll (always-takers).

Using the lottery as an for actual enrollment, you can estimate the :

LATE=ITTYITTD=E[YiZi=1]E[YiZi=0]E[DiZi=1]E[DiZi=0]\text{LATE} = \frac{\text{ITT}_Y}{\text{ITT}_D} = \frac{E[Y_i \mid Z_i = 1] - E[Y_i \mid Z_i = 0]}{E[D_i \mid Z_i = 1] - E[D_i \mid Z_i = 0]}

This expression is the Wald ratio: reduced form ÷ first stage. The numerator (ITTY\text{ITT}_Y) is the reduced-form effect of the instrument ZZ on the outcome YY; the denominator (ITTD\text{ITT}_D) is the first-stage effect of ZZ on the treatment take-up DD. The ratio gives you the causal effect of treatment for — those whose treatment status was actually changed by the random assignment.


CVisual Intuition

Think of randomization as a shuffling machine. You take your sample of people — with all their differences in motivation, ability, health, income — and you shuffle them into two groups completely at random. Each group ends up being a miniature copy of the other, on average.

The key visual: imagine a balance scale. Before randomization, the treatment group could be heavier on one side (more motivated people, higher income, whatever). After randomization, the scale is balanced — not perfectly for any single experiment, but in expectation across repeated randomizations.

This expectation is why balance tables matter. If your randomization worked, the treatment and control groups should look similar on all observed characteristics. A lets you verify this expectation.


DMathematical Derivation

Don't worry about the notation yet — here's what this means in words: Random assignment makes the treated and control groups identical in expectation, so a simple comparison of group averages recovers the true causal effect.

Start with the observed difference in means:

Δ=E[YiDi=1]E[YiDi=0]\Delta = E[Y_i \mid D_i = 1] - E[Y_i \mid D_i = 0]

By the switching equation, the observed outcome is Yi=DiYi(1)+(1Di)Yi(0)Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0). So:

Δ=E[Yi(1)Di=1]E[Yi(0)Di=0]\Delta = E[Y_i(1) \mid D_i = 1] - E[Y_i(0) \mid D_i = 0]

Now add and subtract E[Yi(0)Di=1]E[Y_i(0) | D_i = 1]:

Δ=E[Yi(1)Yi(0)Di=1]ATT+E[Yi(0)Di=1]E[Yi(0)Di=0]Selection bias\Delta = \underbrace{E[Y_i(1) - Y_i(0) \mid D_i = 1]}_{\text{ATT}} + \underbrace{E[Y_i(0) \mid D_i = 1] - E[Y_i(0) \mid D_i = 0]}_{\text{Selection bias}}

Under random assignment, Di ⁣ ⁣ ⁣(Yi(1),Yi(0))D_i \perp\!\!\!\perp (Y_i(1), Y_i(0)), so the selection bias term is zero:

E[Yi(0)Di=1]=E[Yi(0)Di=0]=E[Yi(0)]E[Y_i(0) \mid D_i = 1] = E[Y_i(0) \mid D_i = 0] = E[Y_i(0)]

Therefore Δ=E[Yi(1)Yi(0)]=ATE\Delta = E[Y_i(1) - Y_i(0)] = \text{ATE}.

The ATT also equals the ATE under random assignment, because treatment is independent of potential outcomes.


EImplementation

# Requires: fixest, modelsummary
library(fixest)        # fixest: fast estimation with robust/clustered SEs
library(modelsummary)  # modelsummary: publication-quality regression tables

# --- Step 1: Balance table (verify randomization) ---
# Regress each pre-treatment covariate on the treatment indicator
# Small, insignificant coefficients = good randomization balance
balance_vars <- c("age", "female", "income", "education")
bal <- lapply(balance_vars, function(v) {
feols(as.formula(paste(v, "~ treatment")), data = df)
})
# Focus on magnitudes (standardized differences), not just p-values
modelsummary(bal, stars = TRUE)

# --- Step 2: ITT estimate (intent-to-treat) ---
# Simple difference in means: regress outcome on treatment assignment
# ITT is valid under randomization alone — no compliance assumptions needed
# vcov = "HC1": heteroskedasticity-robust (Huber-White) standard errors
itt <- feols(outcome ~ treatment, data = df, vcov = "HC1")
summary(itt)
# Coefficient on treatment: causal effect of being ASSIGNED to treatment

# --- Step 3: LATE via IV (for non-compliance) ---
# When some assigned units do not take up treatment, ITT < true effect
# feols IV syntax: outcome ~ exogenous | FE | endogenous ~ instrument
# Uses random assignment as instrument for actual treatment takeup
# LATE = effect on compliers (those who take up when assigned)
late <- feols(outcome ~ 1 | 0 | takeup ~ assignment, data = df, vcov = "HC1")
summary(late)
# Coefficient on takeup: LATE (causal effect for compliers only)

FDiagnostics

Balance Checks

A key diagnostic for any experiment. Compare pre-treatment covariates across treatment and control groups. Report:

  • Group means and standard deviations
  • Difference and its p-value (or standardized difference)
  • An F-test for joint significance of all covariates predicting treatment

Attrition Checks

(people dropping out of the study) is only a problem if it is differential — if treatment causes people to leave the sample at different rates. Check:

  1. Is the attrition rate similar across treatment and control?
  2. Among non-attritors, is balance still maintained?
  3. Consider Lee bounds for worst-case scenarios (Lee, 2009).

Compliance Checks

Report the first-stage compliance rate: what fraction of the assigned-to-treatment group actually received treatment? A first-stage below 100% means you need to decide between ITT and LATE.


Interpreting Your Results

  • The ITT is a policy-relevant parameter: it tells you what happens when you roll out an intervention in practice, including non-compliance.
  • The LATE tells you what the treatment does for people who actually take it up, but it only applies to compliers.
  • If compliance is near 100%, ITT and LATE are approximately the same.
  • It is recommended to report the ITT. Report the LATE as a complement, not a replacement.

GWhat Can Go Wrong

ThreatWhat It DoesHow to Diagnose
Non-complianceCreates a gap between assignment and receiptReport compliance rates; use LATE/IV
AttritionBreaks random assignment if differentialCompare attrition rates; Lee bounds
Spillovers (SUTVA violation)Treatment affects control group outcomesLook for evidence of contamination; use designs that minimize contact
Subjects change behavior because they know they are observedUse double-blind designs; compare to administrative data
Subjects figure out the hypothesis and behave accordinglyCareful framing; use deception where ethical
Low powerFail to detect real effects via pre-registration with power analysis
What Can Go Wrong

Differential Attrition

Attrition is 8% in treatment and 9% in control (no significant difference), and balance is maintained among non-attritors

ITT estimate: -0.05 ER visits (SE = 0.02). Lee bounds: [-0.08, -0.02]. Attrition does not threaten internal validity.

What Can Go Wrong

Non-Compliance Ignored in Analysis

Compliance is 25%. ITT is reported as the primary estimate; LATE is computed via IV using assignment as an instrument for take-up

ITT = -0.05 ER visits. LATE (for compliers) = -0.20. Both estimates are clearly labeled and interpreted.

What Can Go Wrong

SUTVA Violation (Spillovers)

Treatment and control groups are in separate villages with no interaction, so one group's treatment does not affect the other's outcomes

ITT = 0.15 SD improvement in test scores. No evidence of contamination between groups.

Concept Check

In a hypothetical lottery-based health insurance experiment, about 25% of lottery winners actually enrolled. If the ITT estimate on a health outcome is -0.05, what is the LATE?


HPractice

Concept Check

A researcher runs a randomized controlled trial (RCT) but 30% of the treatment group does not take up the intervention. She drops non-compliers from the treatment group and compares the remaining treated individuals to the full control group. What is the problem?

Concept Check

In a cluster-randomized trial, 50 villages are assigned to treatment and 50 to control. A child in a treated village plays with untreated children from a neighboring control village, and the intervention's benefits spill over. What assumption is violated?

Concept Check

An experiment randomizes 500 students to tutoring (250) or control (250). After 6 months, 60 students in the treatment group and 15 in the control group have left the study. The researcher reports the ITT using only the remaining students. Should you be concerned?

Concept Check

A firm randomizes which customers receive a discount coupon. Customers who receive the coupon share it with their friends (who are in the control group). What is the likely effect on the ITT estimate?

Guided Exercise

Calculate the ITT, the first-stage compliance rate, and the LATE.

You run a randomized controlled trial (RCT) of a tutoring program on test scores. 200 students are randomly assigned: 100 to tutoring, 100 to control. Of the 100 assigned to tutoring, 80 actually attend. The average test score in the treatment group (all 100) is 78 and in the control group is 72.

What is the ITT estimate?

What is the first-stage compliance rate?

What is the LATE?

Error Detective

Read the analysis below carefully and identify the errors.

A health economist runs an RCT of a job training program on employment outcomes. 500 individuals are randomized: 250 to training, 250 to control. After 12 months, 40 participants in the treatment group and 10 in the control group have dropped out of the study. The researcher analyzes only the remaining participants and reports:

"The training program increased employment by 12 percentage points (p = 0.003). Because treatment was randomly assigned, this coefficient is a causal estimate free from selection bias. We find no evidence that attrition is a concern because our sample size remains large (N = 450)."

Select all errors you can find:

Error Detective

Read the analysis below carefully and identify the errors.

A development economist evaluates a conditional cash transfer (CCT) program. Villages are randomly assigned to treatment (receive CCT) or control. The researcher finds that treated villages have 15% higher school enrollment. They then want to estimate the effect on test scores, but test scores are only available for enrolled students. They report:

"Among enrolled students, treated villages score 2 points higher on standardized tests (p = 0.04). Combined with the enrollment effect, the CCT program improves both access to and quality of education."

Select all errors you can find:

Referee Exercise

Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.

Paper Summary

The authors study whether providing information about calorie content at restaurants reduces calorie consumption. They randomize 80 restaurants in a large city: 40 display prominent calorie labels on menus, 40 serve as controls. After 6 months, they survey customers exiting each restaurant about their meal choices. They find that calorie labeling reduces average calories ordered by 45 kcal (SE = 18, p = 0.013). The first stage shows 95% compliance (38 of 40 treatment restaurants displayed labels). They report only the ITT.

Key Table

VariableCoefficientSEp-value
Assigned to labeling-45.218.10.013
Customer age2.10.80.009
Customer female-82.315.40.000
Weekend visit67.814.20.000
Restaurant FENo
Clustered SEsRestaurant
N (customers)12,400

Authors' Identification Claim

Random assignment of calorie labeling across restaurants ensures that the treatment and control groups are comparable in expectation, yielding an unbiased estimate of the effect of calorie information on ordering behavior.


ISwap-In: When to Use Something Else

If randomization is infeasible (ethical constraints, cost, or lack of control), the closest alternatives are:

For any of these approaches, sensitivity analysis is important for assessing how robust your conclusions are to potential violations of identifying assumptions. The further you move from randomization, the more assumptions you need, and the less credible your causal claims become. But a well-designed quasi-experiment often beats a poorly executed RCT.


JReviewer Checklist

Critical Reading Checklist

0 of 8 items checked0%

Paper Library

Foundational (7)

Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables.

Journal of the American Statistical AssociationDOI: 10.1080/01621459.1996.10476902

Angrist, Imbens, and Rubin formalize the LATE framework — originally introduced in Imbens and Angrist (1994) — within the Rubin Causal Model, providing a detailed treatment of the assumptions required for causal interpretation of IV estimates. This paper introduces the complier taxonomy (always-takers, never-takers, compliers, defiers) that is now standard in the IV literature. The practical implication is that IV estimates should be interpreted as local to the complier subpopulation, not as average effects for the entire population.

Athey, S., & Imbens, G. W. (2017). The Econometrics of Randomized Experiments.

Handbook of Economic Field ExperimentsDOI: 10.1016/bs.hefe.2016.10.003

Athey and Imbens provide a modern, rigorous treatment of the econometrics behind randomized experiments. They cover design, analysis, and inference issues such as stratification, clustering, and multiple hypothesis testing. It is an excellent reference for researchers running field experiments.

Bruhn, M., & McKenzie, D. (2009). In Pursuit of Balance: Randomization in Practice in Development Field Experiments.

American Economic Journal: Applied EconomicsDOI: 10.1257/app.1.4.200

Bruhn and McKenzie compare different randomization methods—simple, stratified, and pairwise—in practice and show that stratified randomization substantially improves balance on baseline covariates and increases statistical power. They provide practical recommendations for choosing among randomization procedures in field experiments.

Dunning, T. (2012). Natural Experiments in the Social Sciences: A Design-Based Approach.

Cambridge University PressDOI: 10.1017/CBO9781139084444

Dunning provides a systematic framework for identifying and analyzing natural experiments across the social sciences. The book covers as-if random assignment, instrumental variables, regression discontinuity, and difference-in-differences through a unified design-based lens, making it essential reading for researchers exploiting natural variation for causal inference.

Fisher, R. A. (1935). The Design of Experiments.

Oliver & Boyd

Fisher's classic book lays the foundations of experimental design, introducing concepts like randomization, blocking, and factorial designs. The 'lady tasting tea' example from this book remains one of the most famous illustrations of hypothesis testing and the logic of controlled experiments.

Harrison, G. W., & List, J. A. (2004). Field Experiments.

Journal of Economic LiteratureDOI: 10.1257/0022051043004577

Harrison and List provide an influential taxonomy of field experiments, distinguishing artefactual, framed, and natural field experiments from conventional lab experiments. The paper helps establish field experiments as a mainstream methodology in economics.

Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.

Journal of Educational PsychologyDOI: 10.1037/h0037350

Rubin formalizes the 'potential outcomes' framework that is now central to causal inference. The idea is simple but powerful: each unit has a potential outcome under treatment and under control, and the causal effect is the difference. This paper is the origin of what is now called the Rubin Causal Model.

Application (18)

Acquisti, A., & Fong, C. M. (2020). An Experiment in Hiring Discrimination via Online Social Networks.

Management ScienceDOI: 10.1287/mnsc.2018.3269

Acquisti and Fong conduct a correspondence experiment using social media profiles to study hiring discrimination based on religion and sexual orientation. They find no significant national-level discrimination against Muslim or gay candidates, but significant anti-Muslim discrimination emerges in Republican-leaning areas. The paper illustrates how online information creates new channels for employment discrimination that vary with local attitudes.

Bandiera, O., Barankay, I., & Rasul, I. (2005). Social Preferences and the Response to Incentives: Evidence from Personnel Data.

Quarterly Journal of EconomicsDOI: 10.1093/qje/120.3.917

Bandiera, Barankay, and Rasul use a field experiment in a fruit-picking firm to study how switching from relative to piece-rate pay affects productivity. They demonstrate that social preferences among workers matter for incentive design, bridging experimental economics and management.

Banerjee, A., Duflo, E., Goldberg, N., Karlan, D., Osei, R., Pariente, W., Shapiro, J., Thuysbaert, B., & Udry, C. (2015). A Multifaceted Program Causes Lasting Progress for the Very Poor: Evidence from Six Countries.

Banerjee, Duflo, and colleagues conduct a large-scale RCT across six countries, demonstrating that a multifaceted anti-poverty program produces sustained economic gains for the ultra-poor. The study is notable for its multi-site design, which provides rare multi-country evidence on how the same intervention performs across diverse contexts. It demonstrates both the power of randomized evaluation at scale and the importance of bundled interventions when individual components may be insufficient.

Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.

American Economic ReviewDOI: 10.1257/0002828042002561

Bertrand and Mullainathan send fictitious resumes with randomly assigned names to employers and find that 'white-sounding' names receive 50% more callbacks in this famous audit study. It is one of the most widely cited field experiments in social science and a powerful example of how randomization can identify discrimination.

Bloom, N., Liang, J., Roberts, J., & Ying, Z. J. (2015). Does Working from Home Work? Evidence from a Chinese Experiment.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qju032

Bloom and colleagues conduct a large-scale randomized experiment at a Chinese travel agency, finding that working from home leads to a 13% performance increase. The study becomes a landmark reference in management and labor economics for its clean experimental design applied to a practical workplace question.

Camuffo, A., Cordova, A., Gambardella, A., & Spina, C. (2020). A Scientific Approach to Entrepreneurial Decision Making: Evidence from a Randomized Control Trial.

Management ScienceDOI: 10.1287/mnsc.2018.3249

Camuffo and colleagues conduct a randomized controlled trial with 116 Italian startups, randomly assigning half to receive training in a 'scientific' approach to entrepreneurial decision-making (formulating and testing hypotheses before committing resources). Treated startups perform better, are more likely to pivot, and are not more likely to drop out, providing experimental evidence that structured decision-making improves entrepreneurial outcomes.

Camuffo, A., Gambardella, A., Messinese, D., Novelli, E., Paolucci, E., & Spina, C. (2024). A Scientific Approach to Entrepreneurial Decision-Making: Large-Scale Replication and Extension.

Strategic Management JournalDOI: 10.1002/smj.3580

Camuffo and colleagues conduct four randomized controlled trials with 759 firms across Italy, the UK, and India, replicating and extending their earlier finding that training entrepreneurs to adopt a 'scientific' approach to decision-making improves venture performance. The multi-site, multi-country design provides strong evidence on the external validity of the original RCT findings.

Chatterji, A. K., Findley, M., Jensen, N. M., Meier, S., & Nielson, D. (2016). Field Experiments in Strategy Research.

Strategic Management JournalDOI: 10.1002/smj.2449

Chatterji, Findley, Jensen, Meier, and Nielson make the case for using field experiments in strategy research and provide practical guidance for doing so. They discuss internal validity, external validity, and ethical considerations specific to strategy scholars.

Crepon, B., Duflo, E., Gurgand, M., Rathelot, R., & Zamora, P. (2013). Do Labor Market Policies Have Displacement Effects? Evidence from a Clustered Randomized Experiment.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjt001

Crepon and colleagues evaluate a job placement assistance program in France using a two-step clustered randomization design that varies treatment intensity across 235 labor markets. The paper's key contribution is identifying displacement effects: treated job seekers gain at the expense of untreated competitors, particularly in weak labor markets and among workers with similar skills. This innovative experimental design allows estimation of both direct and indirect (general equilibrium) effects of active labor market policies.

Finkelstein, A., Taubman, S., Wright, B., Bernstein, M., Gruber, J., Newhouse, J. P., Allen, H., Baicker, K., & The Oregon Health Study Group (2012). The Oregon Health Insurance Experiment: Evidence from the First Year.

Quarterly Journal of EconomicsDOI: 10.1093/qje/qjs020

Finkelstein and colleagues analyze the Oregon Health Insurance Experiment, in which uninsured low-income adults are selected by lottery for the chance to apply for Medicaid. Using this randomized controlled design with IV to handle noncompliance, they estimate the local average treatment effect of Medicaid coverage on health care utilization, financial strain, and self-reported health. The study demonstrates the practical difference between intent-to-treat and LATE estimates in a real-world experiment where not all lottery winners enrolled.

Friebel, G., Heinz, M., & Zubanov, N. (2022). Middle Managers, Personnel Turnover, and Performance: A Long-Term Field Experiment in a Retail Chain.

Management ScienceDOI: 10.1287/mnsc.2020.3905

Friebel, Heinz, and Zubanov conduct a long-term randomized field experiment in a large Eastern European retail chain, in which the CEO asked treated store managers to reduce employee quit rates. The intervention decreased the quit rate by a fifth to a quarter, lasting nine months before petering out, but reappearing after a reminder. However, there is no treatment effect on sales, illustrating that reducing turnover does not automatically translate into improved store performance.

Gornall, W., & Strebulaev, I. A. (2025). Gender, Race, and Entrepreneurship: A Randomized Field Experiment on Venture Capitalists and Angels.

Management ScienceDOI: 10.1287/mnsc.2024.4990

Gornall and Strebulaev conduct a large-scale correspondence experiment, sending approximately 80,000 pitch emails from fictitious startups to 28,000 venture capitalists and angel investors. By randomly varying the entrepreneur's name to signal gender and race, they find that female entrepreneurs received 9% more interested replies and Asian-surname entrepreneurs received 6% more responses than White-surname entrepreneurs, indicating favorable rather than adverse bias. The paper provides large-scale experimental evidence on investor response patterns by entrepreneur demographics in entrepreneurial finance.

Grant, A. M. (2008). The Significance of Task Significance: Job Performance Effects, Relational Mechanisms, and Boundary Conditions.

Journal of Applied PsychologyDOI: 10.1037/0021-9010.93.1.108

Grant conducts field experiments showing that briefly exposing workers to the beneficiaries of their work significantly increased their motivation and performance. This paper is a well-known example of experimental design applied within organizational behavior research.

Hoogendoorn, S., Parker, S. C., & van Praag, M. (2017). Smart or Diverse Start-up Teams? Evidence from a Field Experiment.

Organization ScienceDOI: 10.1287/orsc.2017.1158

Hoogendoorn, Parker, and van Praag conduct a field experiment with 573 students randomly assigned to 49 startup teams that varied in cognitive ability dispersion. They find an inverted U-shaped relationship between ability dispersion and team performance, with moderately diverse teams in ability outperforming both homogeneous and highly dispersed teams. The random assignment to teams ensures that ability composition is exogenous, providing clean experimental identification of the effect of team cognitive diversity on venture performance.

Hurst, R., Lee, S., & Frake, J. (2024). The Effect of Flatter Hierarchy on Applicant Pool Gender Diversity: Evidence from Experiments.

Strategic Management JournalDOI: 10.1002/smj.3590

Hurst, Lee, and Frake conduct a reverse audit study in partnership with a U.S. healthcare startup, sending recruitment emails to approximately 8,400 job seekers with randomly varied descriptions of the firm's organizational hierarchy. Featuring a flatter hierarchy did not significantly affect applicant pool size but significantly decreased women's representation, because women perceived flatter structures as offering fewer career advancement opportunities and greater workload burdens.

Jia, N., Luo, X., Fang, Z., & Liao, C. (2024). When and How Artificial Intelligence Augments Employee Creativity.

Academy of Management JournalDOI: 10.5465/amj.2022.0426

Jia, Luo, Fang, and Liao conduct a field experiment examining how AI assistance affects creative work through a sequential division of labor. They find that AI augmentation improves average output quality but reduces the novelty of top-performing work, with effects moderated by employee skill level. The paper provides causal evidence on the productivity implications of human-AI collaboration in knowledge work.

Kang, S. K., DeCelles, K. A., Tilcsik, A., & Jun, S. (2016). Whitened Résumés: Race and Self-Presentation in the Labor Market.

Administrative Science QuarterlyDOI: 10.1177/0001839216639577

Kang and colleagues conduct a résumé audit study sending fictitious applications to real employers, finding that minority applicants who 'whitened' their résumés received significantly more callbacks. The study combines a correspondence experiment with qualitative interviews, providing a powerful example of how audit studies can identify discrimination in hiring.

Pongeluppe, L. S. (2024). The Allegory of the Favela: The Multifaceted Effects of Socioeconomic Mobility.

Administrative Science QuarterlyDOI: 10.1177/00018392241240469

Pongeluppe conducts a randomized controlled trial of a business training program offered to residents of Brazilian favelas, complementing the experiment with quantile regressions, field visits, and interviews. The results show that training improves economic outcomes such as income and entrepreneurship participation, but also intensifies participants' experiences of favela-related stigma, revealing that socioeconomic mobility can simultaneously generate material benefits and psychosocial costs.

Survey (5)

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.

Princeton University PressDOI: 10.1515/9781400829828

Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.

Duflo, E., Glennerster, R., & Kremer, M. (2007). Using Randomization in Development Economics Research: A Toolkit.

Handbook of Development EconomicsDOI: 10.1016/S1573-4471(07)04061-2

Duflo, Glennerster, and Kremer write a comprehensive practical guide to running randomized experiments in development economics. The chapter covers all stages from design to analysis, including power calculations, stratification, dealing with attrition, and estimating treatment effects with imperfect compliance. It has become required reading for anyone designing a field experiment.

Gerber, A. S., & Green, D. P. (2012). Field Experiments: Design, Analysis, and Interpretation.

W. W. Norton

Gerber and Green write a comprehensive textbook on field experiments covering randomization, blocking, clustering, noncompliance, and attrition. The book provides rigorous treatment of experimental design principles with practical guidance drawn from political science and public policy applications. It is particularly valuable for its coverage of complications that arise in real-world experiments, including how to handle noncompliance through intent-to-treat analysis and instrumental variables.

List, J. A., Sadoff, S., & Wagner, M. (2011). So You Want to Run an Experiment, Now What? Some Simple Rules of Thumb for Optimal Experimental Design.

Experimental EconomicsDOI: 10.1007/s10683-011-9275-7

List, Sadoff, and Wagner provide rules of thumb for sample size, treatment assignment, and other design decisions in field experiments in this practical guide. It is a useful starting point for researchers planning their first experiment.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.

MIT Press

Wooldridge's graduate textbook is the standard reference for cross-section and panel data econometrics. Chapters 10-11 provide a thorough treatment of fixed effects, random effects, and related panel data methods, while later chapters cover general estimation methodology (MLE, GMM, M-estimation) with panel data applications throughout. The book covers both linear and nonlinear models with careful attention to assumptions.

Tags

design-basedrandomizationgold-standard