Experimental Design
The gold standard for internal validity — random assignment eliminates selection bias by design.
One-Line Implementation
feols(outcome ~ treatment, data = df, vcov = 'HC1')reg outcome treatment, vce(robust)smf.ols('outcome ~ treatment', data=df).fit(cov_type='HC1')Download Full Analysis Code
Complete scripts with diagnostics, robustness checks, and result export.
Motivating Example: The Oregon Health Insurance Experiment
In 2008, Oregon reopened enrollment in its Medicaid program (OHP Standard) but had far more applicants than slots. The state held a lottery — a literal random draw — to decide who would get the opportunity to enroll. The lottery created one of the most important experiments in health economics (Finkelstein et al., 2012).
The researchers could compare lottery winners (who were offered insurance) to lottery losers. Because assignment was random, the two groups were identical in expectation on every dimension — income, health status, education, motivation, everything. Any difference in outcomes could be attributed to the insurance offer itself.
This balance is the power of experimental design. You do not need to measure and control for every confounder. Randomization handles it for you.
But here is the catch: not everyone who won the lottery actually enrolled in Medicaid. And this creates a gap between what was randomly assigned (the offer) and what was actually received (the insurance). Understanding this gap is one of the central lessons of this page.
AOverview
The is defined as:
The fundamental problem is that we never observe both for the same unit. But random assignment solves the comparison problem — it eliminates selection bias in expectation, which is why causal inference requires careful research design. When treatment is randomly assigned:
In plain language: the average untreated outcome is the same for the treated group and the control group. The control group is a valid stand-in for the counterfactual. A simple difference in means recovers the ATE:
This estimator gives an identical point estimate to OLS with a single treatment dummy. With homoscedastic errors and balanced groups, the standard errors are also identical; they differ under heteroscedasticity or imbalance, where robust SEs are preferred.
The Three Pillars of a Good Experiment
- Random assignment — units are allocated to treatment and control by a mechanism the researcher controls.
- No interference — one unit's treatment does not affect another unit's outcome (the , or SUTVA).
- Excludability — the assignment mechanism affects outcomes only through the treatment itself, not through other channels.
Common Confusions
BIdentification
The Mechanics of Randomization
Randomization creates through a simple but powerful mechanism: it makes treatment assignment statistically independent of potential outcomes.
This independence means there is no — a concept explored in depth on the selection bias foundations page. The people in the treatment group are, on average, identical to those in the control group in every way — observed and unobserved.
SUTVA (Stable Unit Treatment Value Assumption)
Randomization alone is not sufficient for identification. We also need the : no interference between units — one unit's treatment assignment does not affect another unit's outcomes — and no hidden variations of treatment. In the Oregon experiment, this means that one person winning the lottery did not change another person's health outcomes, and that Medicaid coverage was the same for all enrollees. SUTVA violations (such as ) can attenuate or inflate the estimated treatment effect even when randomization is intact.
SUTVA also implicitly rules out general equilibrium effects: scaling up an intervention that works in a small experiment may change market conditions (e.g., a job training program that works for 100 people may depress wages if applied to 100,000). This limitation means that experimentally estimated treatment effects may not survive extrapolation to policy-relevant scales.
When you compare outcomes by assignment (regardless of whether subjects actually took up the treatment), you get the ITT:
where is the random assignment indicator. Under intact randomization and no differential attrition, the ITT is a valid causal effect. It answers: "What is the effect of being assigned to the treatment group?"
LATE for Non-Compliance
In the Oregon experiment, was winning the lottery, but (actually enrolling in Medicaid) was a choice. Some winners did not enroll (never-takers), and in principle, some non-winners might have found other ways to enroll (always-takers).
Using the lottery as an for actual enrollment, you can estimate the :
This expression is the Wald ratio: reduced form ÷ first stage. The numerator () is the reduced-form effect of the instrument on the outcome ; the denominator () is the first-stage effect of on the treatment take-up . The ratio gives you the causal effect of treatment for — those whose treatment status was actually changed by the random assignment.
CVisual Intuition
Think of randomization as a shuffling machine. You take your sample of people — with all their differences in motivation, ability, health, income — and you shuffle them into two groups completely at random. Each group ends up being a miniature copy of the other, on average.
The key visual: imagine a balance scale. Before randomization, the treatment group could be heavier on one side (more motivated people, higher income, whatever). After randomization, the scale is balanced — not perfectly for any single experiment, but in expectation across repeated randomizations.
This expectation is why balance tables matter. If your randomization worked, the treatment and control groups should look similar on all observed characteristics. A lets you verify this expectation.
DMathematical Derivation
Don't worry about the notation yet — here's what this means in words: Random assignment makes the treated and control groups identical in expectation, so a simple comparison of group averages recovers the true causal effect.
Start with the observed difference in means:
By the switching equation, the observed outcome is . So:
Now add and subtract :
Under random assignment, , so the selection bias term is zero:
Therefore .
The ATT also equals the ATE under random assignment, because treatment is independent of potential outcomes.
EImplementation
# Requires: fixest, modelsummary
library(fixest) # fixest: fast estimation with robust/clustered SEs
library(modelsummary) # modelsummary: publication-quality regression tables
# --- Step 1: Balance table (verify randomization) ---
# Regress each pre-treatment covariate on the treatment indicator
# Small, insignificant coefficients = good randomization balance
balance_vars <- c("age", "female", "income", "education")
bal <- lapply(balance_vars, function(v) {
feols(as.formula(paste(v, "~ treatment")), data = df)
})
# Focus on magnitudes (standardized differences), not just p-values
modelsummary(bal, stars = TRUE)
# --- Step 2: ITT estimate (intent-to-treat) ---
# Simple difference in means: regress outcome on treatment assignment
# ITT is valid under randomization alone — no compliance assumptions needed
# vcov = "HC1": heteroskedasticity-robust (Huber-White) standard errors
itt <- feols(outcome ~ treatment, data = df, vcov = "HC1")
summary(itt)
# Coefficient on treatment: causal effect of being ASSIGNED to treatment
# --- Step 3: LATE via IV (for non-compliance) ---
# When some assigned units do not take up treatment, ITT < true effect
# feols IV syntax: outcome ~ exogenous | FE | endogenous ~ instrument
# Uses random assignment as instrument for actual treatment takeup
# LATE = effect on compliers (those who take up when assigned)
late <- feols(outcome ~ 1 | 0 | takeup ~ assignment, data = df, vcov = "HC1")
summary(late)
# Coefficient on takeup: LATE (causal effect for compliers only)FDiagnostics
Balance Checks
A key diagnostic for any experiment. Compare pre-treatment covariates across treatment and control groups. Report:
- Group means and standard deviations
- Difference and its p-value (or standardized difference)
- An F-test for joint significance of all covariates predicting treatment
Attrition Checks
(people dropping out of the study) is only a problem if it is differential — if treatment causes people to leave the sample at different rates. Check:
- Is the attrition rate similar across treatment and control?
- Among non-attritors, is balance still maintained?
- Consider Lee bounds for worst-case scenarios (Lee, 2009).
Compliance Checks
Report the first-stage compliance rate: what fraction of the assigned-to-treatment group actually received treatment? A first-stage below 100% means you need to decide between ITT and LATE.
Interpreting Your Results
- The ITT is a policy-relevant parameter: it tells you what happens when you roll out an intervention in practice, including non-compliance.
- The LATE tells you what the treatment does for people who actually take it up, but it only applies to compliers.
- If compliance is near 100%, ITT and LATE are approximately the same.
- It is recommended to report the ITT. Report the LATE as a complement, not a replacement.
GWhat Can Go Wrong
| Threat | What It Does | How to Diagnose |
|---|---|---|
| Non-compliance | Creates a gap between assignment and receipt | Report compliance rates; use LATE/IV |
| Attrition | Breaks random assignment if differential | Compare attrition rates; Lee bounds |
| Spillovers (SUTVA violation) | Treatment affects control group outcomes | Look for evidence of contamination; use designs that minimize contact |
| Subjects change behavior because they know they are observed | Use double-blind designs; compare to administrative data | |
| Subjects figure out the hypothesis and behave accordingly | Careful framing; use deception where ethical | |
| Low power | Fail to detect real effects | via pre-registration with power analysis |
Differential Attrition
Attrition is 8% in treatment and 9% in control (no significant difference), and balance is maintained among non-attritors
ITT estimate: -0.05 ER visits (SE = 0.02). Lee bounds: [-0.08, -0.02]. Attrition does not threaten internal validity.
Non-Compliance Ignored in Analysis
Compliance is 25%. ITT is reported as the primary estimate; LATE is computed via IV using assignment as an instrument for take-up
ITT = -0.05 ER visits. LATE (for compliers) = -0.20. Both estimates are clearly labeled and interpreted.
SUTVA Violation (Spillovers)
Treatment and control groups are in separate villages with no interaction, so one group's treatment does not affect the other's outcomes
ITT = 0.15 SD improvement in test scores. No evidence of contamination between groups.
In a hypothetical lottery-based health insurance experiment, about 25% of lottery winners actually enrolled. If the ITT estimate on a health outcome is -0.05, what is the LATE?
HPractice
A researcher runs a randomized controlled trial (RCT) but 30% of the treatment group does not take up the intervention. She drops non-compliers from the treatment group and compares the remaining treated individuals to the full control group. What is the problem?
In a cluster-randomized trial, 50 villages are assigned to treatment and 50 to control. A child in a treated village plays with untreated children from a neighboring control village, and the intervention's benefits spill over. What assumption is violated?
An experiment randomizes 500 students to tutoring (250) or control (250). After 6 months, 60 students in the treatment group and 15 in the control group have left the study. The researcher reports the ITT using only the remaining students. Should you be concerned?
A firm randomizes which customers receive a discount coupon. Customers who receive the coupon share it with their friends (who are in the control group). What is the likely effect on the ITT estimate?
Calculate the ITT, the first-stage compliance rate, and the LATE.
You run a randomized controlled trial (RCT) of a tutoring program on test scores. 200 students are randomly assigned: 100 to tutoring, 100 to control. Of the 100 assigned to tutoring, 80 actually attend. The average test score in the treatment group (all 100) is 78 and in the control group is 72.
Read the analysis below carefully and identify the errors.
A health economist runs an RCT of a job training program on employment outcomes. 500 individuals are randomized: 250 to training, 250 to control. After 12 months, 40 participants in the treatment group and 10 in the control group have dropped out of the study. The researcher analyzes only the remaining participants and reports:
"The training program increased employment by 12 percentage points (p = 0.003). Because treatment was randomly assigned, this coefficient is a causal estimate free from selection bias. We find no evidence that attrition is a concern because our sample size remains large (N = 450)."
Select all errors you can find:
Read the analysis below carefully and identify the errors.
A development economist evaluates a conditional cash transfer (CCT) program. Villages are randomly assigned to treatment (receive CCT) or control. The researcher finds that treated villages have 15% higher school enrollment. They then want to estimate the effect on test scores, but test scores are only available for enrolled students. They report:
"Among enrolled students, treated villages score 2 points higher on standardized tests (p = 0.04). Combined with the enrollment effect, the CCT program improves both access to and quality of education."
Select all errors you can find:
Read the paper summary below and write a brief referee critique (2-3 sentences) of the identification strategy.
Paper Summary
The authors study whether providing information about calorie content at restaurants reduces calorie consumption. They randomize 80 restaurants in a large city: 40 display prominent calorie labels on menus, 40 serve as controls. After 6 months, they survey customers exiting each restaurant about their meal choices. They find that calorie labeling reduces average calories ordered by 45 kcal (SE = 18, p = 0.013). The first stage shows 95% compliance (38 of 40 treatment restaurants displayed labels). They report only the ITT.
Key Table
| Variable | Coefficient | SE | p-value |
|---|---|---|---|
| Assigned to labeling | -45.2 | 18.1 | 0.013 |
| Customer age | 2.1 | 0.8 | 0.009 |
| Customer female | -82.3 | 15.4 | 0.000 |
| Weekend visit | 67.8 | 14.2 | 0.000 |
| Restaurant FE | No | ||
| Clustered SEs | Restaurant | ||
| N (customers) | 12,400 |
Authors' Identification Claim
Random assignment of calorie labeling across restaurants ensures that the treatment and control groups are comparable in expectation, yielding an unbiased estimate of the effect of calorie information on ordering behavior.
ISwap-In: When to Use Something Else
If randomization is infeasible (ethical constraints, cost, or lack of control), the closest alternatives are:
- Natural experiments — situations where nature or policy creates as-if random assignment. See IV / 2SLS and Regression Discontinuity.
- Matching — construct a comparison group that looks similar on observables.
- Difference-in-differences — exploit a policy change that affects some groups but not others.
For any of these approaches, sensitivity analysis is important for assessing how robust your conclusions are to potential violations of identifying assumptions. The further you move from randomization, the more assumptions you need, and the less credible your causal claims become. But a well-designed quasi-experiment often beats a poorly executed RCT.
JReviewer Checklist
Critical Reading Checklist
Paper Library
Foundational (7)
Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables.
Angrist, Imbens, and Rubin formalize the LATE framework — originally introduced in Imbens and Angrist (1994) — within the Rubin Causal Model, providing a detailed treatment of the assumptions required for causal interpretation of IV estimates. This paper introduces the complier taxonomy (always-takers, never-takers, compliers, defiers) that is now standard in the IV literature. The practical implication is that IV estimates should be interpreted as local to the complier subpopulation, not as average effects for the entire population.
Athey, S., & Imbens, G. W. (2017). The Econometrics of Randomized Experiments.
Athey and Imbens provide a modern, rigorous treatment of the econometrics behind randomized experiments. They cover design, analysis, and inference issues such as stratification, clustering, and multiple hypothesis testing. It is an excellent reference for researchers running field experiments.
Bruhn, M., & McKenzie, D. (2009). In Pursuit of Balance: Randomization in Practice in Development Field Experiments.
Bruhn and McKenzie compare different randomization methods—simple, stratified, and pairwise—in practice and show that stratified randomization substantially improves balance on baseline covariates and increases statistical power. They provide practical recommendations for choosing among randomization procedures in field experiments.
Dunning, T. (2012). Natural Experiments in the Social Sciences: A Design-Based Approach.
Dunning provides a systematic framework for identifying and analyzing natural experiments across the social sciences. The book covers as-if random assignment, instrumental variables, regression discontinuity, and difference-in-differences through a unified design-based lens, making it essential reading for researchers exploiting natural variation for causal inference.
Fisher, R. A. (1935). The Design of Experiments.
Fisher's classic book lays the foundations of experimental design, introducing concepts like randomization, blocking, and factorial designs. The 'lady tasting tea' example from this book remains one of the most famous illustrations of hypothesis testing and the logic of controlled experiments.
Harrison, G. W., & List, J. A. (2004). Field Experiments.
Harrison and List provide an influential taxonomy of field experiments, distinguishing artefactual, framed, and natural field experiments from conventional lab experiments. The paper helps establish field experiments as a mainstream methodology in economics.
Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.
Rubin formalizes the 'potential outcomes' framework that is now central to causal inference. The idea is simple but powerful: each unit has a potential outcome under treatment and under control, and the causal effect is the difference. This paper is the origin of what is now called the Rubin Causal Model.
Application (18)
Acquisti, A., & Fong, C. M. (2020). An Experiment in Hiring Discrimination via Online Social Networks.
Acquisti and Fong conduct a correspondence experiment using social media profiles to study hiring discrimination based on religion and sexual orientation. They find no significant national-level discrimination against Muslim or gay candidates, but significant anti-Muslim discrimination emerges in Republican-leaning areas. The paper illustrates how online information creates new channels for employment discrimination that vary with local attitudes.
Bandiera, O., Barankay, I., & Rasul, I. (2005). Social Preferences and the Response to Incentives: Evidence from Personnel Data.
Bandiera, Barankay, and Rasul use a field experiment in a fruit-picking firm to study how switching from relative to piece-rate pay affects productivity. They demonstrate that social preferences among workers matter for incentive design, bridging experimental economics and management.
Banerjee, A., Duflo, E., Goldberg, N., Karlan, D., Osei, R., Pariente, W., Shapiro, J., Thuysbaert, B., & Udry, C. (2015). A Multifaceted Program Causes Lasting Progress for the Very Poor: Evidence from Six Countries.
Banerjee, Duflo, and colleagues conduct a large-scale RCT across six countries, demonstrating that a multifaceted anti-poverty program produces sustained economic gains for the ultra-poor. The study is notable for its multi-site design, which provides rare multi-country evidence on how the same intervention performs across diverse contexts. It demonstrates both the power of randomized evaluation at scale and the importance of bundled interventions when individual components may be insufficient.
Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.
Bertrand and Mullainathan send fictitious resumes with randomly assigned names to employers and find that 'white-sounding' names receive 50% more callbacks in this famous audit study. It is one of the most widely cited field experiments in social science and a powerful example of how randomization can identify discrimination.
Bloom, N., Liang, J., Roberts, J., & Ying, Z. J. (2015). Does Working from Home Work? Evidence from a Chinese Experiment.
Bloom and colleagues conduct a large-scale randomized experiment at a Chinese travel agency, finding that working from home leads to a 13% performance increase. The study becomes a landmark reference in management and labor economics for its clean experimental design applied to a practical workplace question.
Camuffo, A., Cordova, A., Gambardella, A., & Spina, C. (2020). A Scientific Approach to Entrepreneurial Decision Making: Evidence from a Randomized Control Trial.
Camuffo and colleagues conduct a randomized controlled trial with 116 Italian startups, randomly assigning half to receive training in a 'scientific' approach to entrepreneurial decision-making (formulating and testing hypotheses before committing resources). Treated startups perform better, are more likely to pivot, and are not more likely to drop out, providing experimental evidence that structured decision-making improves entrepreneurial outcomes.
Camuffo, A., Gambardella, A., Messinese, D., Novelli, E., Paolucci, E., & Spina, C. (2024). A Scientific Approach to Entrepreneurial Decision-Making: Large-Scale Replication and Extension.
Camuffo and colleagues conduct four randomized controlled trials with 759 firms across Italy, the UK, and India, replicating and extending their earlier finding that training entrepreneurs to adopt a 'scientific' approach to decision-making improves venture performance. The multi-site, multi-country design provides strong evidence on the external validity of the original RCT findings.
Chatterji, A. K., Findley, M., Jensen, N. M., Meier, S., & Nielson, D. (2016). Field Experiments in Strategy Research.
Chatterji, Findley, Jensen, Meier, and Nielson make the case for using field experiments in strategy research and provide practical guidance for doing so. They discuss internal validity, external validity, and ethical considerations specific to strategy scholars.
Crepon, B., Duflo, E., Gurgand, M., Rathelot, R., & Zamora, P. (2013). Do Labor Market Policies Have Displacement Effects? Evidence from a Clustered Randomized Experiment.
Crepon and colleagues evaluate a job placement assistance program in France using a two-step clustered randomization design that varies treatment intensity across 235 labor markets. The paper's key contribution is identifying displacement effects: treated job seekers gain at the expense of untreated competitors, particularly in weak labor markets and among workers with similar skills. This innovative experimental design allows estimation of both direct and indirect (general equilibrium) effects of active labor market policies.
Finkelstein, A., Taubman, S., Wright, B., Bernstein, M., Gruber, J., Newhouse, J. P., Allen, H., Baicker, K., & The Oregon Health Study Group (2012). The Oregon Health Insurance Experiment: Evidence from the First Year.
Finkelstein and colleagues analyze the Oregon Health Insurance Experiment, in which uninsured low-income adults are selected by lottery for the chance to apply for Medicaid. Using this randomized controlled design with IV to handle noncompliance, they estimate the local average treatment effect of Medicaid coverage on health care utilization, financial strain, and self-reported health. The study demonstrates the practical difference between intent-to-treat and LATE estimates in a real-world experiment where not all lottery winners enrolled.
Friebel, G., Heinz, M., & Zubanov, N. (2022). Middle Managers, Personnel Turnover, and Performance: A Long-Term Field Experiment in a Retail Chain.
Friebel, Heinz, and Zubanov conduct a long-term randomized field experiment in a large Eastern European retail chain, in which the CEO asked treated store managers to reduce employee quit rates. The intervention decreased the quit rate by a fifth to a quarter, lasting nine months before petering out, but reappearing after a reminder. However, there is no treatment effect on sales, illustrating that reducing turnover does not automatically translate into improved store performance.
Gornall, W., & Strebulaev, I. A. (2025). Gender, Race, and Entrepreneurship: A Randomized Field Experiment on Venture Capitalists and Angels.
Gornall and Strebulaev conduct a large-scale correspondence experiment, sending approximately 80,000 pitch emails from fictitious startups to 28,000 venture capitalists and angel investors. By randomly varying the entrepreneur's name to signal gender and race, they find that female entrepreneurs received 9% more interested replies and Asian-surname entrepreneurs received 6% more responses than White-surname entrepreneurs, indicating favorable rather than adverse bias. The paper provides large-scale experimental evidence on investor response patterns by entrepreneur demographics in entrepreneurial finance.
Grant, A. M. (2008). The Significance of Task Significance: Job Performance Effects, Relational Mechanisms, and Boundary Conditions.
Grant conducts field experiments showing that briefly exposing workers to the beneficiaries of their work significantly increased their motivation and performance. This paper is a well-known example of experimental design applied within organizational behavior research.
Hoogendoorn, S., Parker, S. C., & van Praag, M. (2017). Smart or Diverse Start-up Teams? Evidence from a Field Experiment.
Hoogendoorn, Parker, and van Praag conduct a field experiment with 573 students randomly assigned to 49 startup teams that varied in cognitive ability dispersion. They find an inverted U-shaped relationship between ability dispersion and team performance, with moderately diverse teams in ability outperforming both homogeneous and highly dispersed teams. The random assignment to teams ensures that ability composition is exogenous, providing clean experimental identification of the effect of team cognitive diversity on venture performance.
Hurst, R., Lee, S., & Frake, J. (2024). The Effect of Flatter Hierarchy on Applicant Pool Gender Diversity: Evidence from Experiments.
Hurst, Lee, and Frake conduct a reverse audit study in partnership with a U.S. healthcare startup, sending recruitment emails to approximately 8,400 job seekers with randomly varied descriptions of the firm's organizational hierarchy. Featuring a flatter hierarchy did not significantly affect applicant pool size but significantly decreased women's representation, because women perceived flatter structures as offering fewer career advancement opportunities and greater workload burdens.
Jia, N., Luo, X., Fang, Z., & Liao, C. (2024). When and How Artificial Intelligence Augments Employee Creativity.
Jia, Luo, Fang, and Liao conduct a field experiment examining how AI assistance affects creative work through a sequential division of labor. They find that AI augmentation improves average output quality but reduces the novelty of top-performing work, with effects moderated by employee skill level. The paper provides causal evidence on the productivity implications of human-AI collaboration in knowledge work.
Kang, S. K., DeCelles, K. A., Tilcsik, A., & Jun, S. (2016). Whitened Résumés: Race and Self-Presentation in the Labor Market.
Kang and colleagues conduct a résumé audit study sending fictitious applications to real employers, finding that minority applicants who 'whitened' their résumés received significantly more callbacks. The study combines a correspondence experiment with qualitative interviews, providing a powerful example of how audit studies can identify discrimination in hiring.
Pongeluppe, L. S. (2024). The Allegory of the Favela: The Multifaceted Effects of Socioeconomic Mobility.
Pongeluppe conducts a randomized controlled trial of a business training program offered to residents of Brazilian favelas, complementing the experiment with quantile regressions, field visits, and interviews. The results show that training improves economic outcomes such as income and entrepreneurship participation, but also intensifies participants' experiences of favela-related stigma, revealing that socioeconomic mobility can simultaneously generate material benefits and psychosocial costs.
Survey (5)
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion.
Angrist and Pischke write one of the most influential modern textbooks on applied econometrics, organizing the field around a design-based approach to causal inference. The book provides essential treatments of instrumental variables, difference-in-differences, and regression discontinuity, each grounded in the potential outcomes framework. It remains the standard reference for graduate students learning to evaluate and implement identification strategies.
Duflo, E., Glennerster, R., & Kremer, M. (2007). Using Randomization in Development Economics Research: A Toolkit.
Duflo, Glennerster, and Kremer write a comprehensive practical guide to running randomized experiments in development economics. The chapter covers all stages from design to analysis, including power calculations, stratification, dealing with attrition, and estimating treatment effects with imperfect compliance. It has become required reading for anyone designing a field experiment.
Gerber, A. S., & Green, D. P. (2012). Field Experiments: Design, Analysis, and Interpretation.
Gerber and Green write a comprehensive textbook on field experiments covering randomization, blocking, clustering, noncompliance, and attrition. The book provides rigorous treatment of experimental design principles with practical guidance drawn from political science and public policy applications. It is particularly valuable for its coverage of complications that arise in real-world experiments, including how to handle noncompliance through intent-to-treat analysis and instrumental variables.
List, J. A., Sadoff, S., & Wagner, M. (2011). So You Want to Run an Experiment, Now What? Some Simple Rules of Thumb for Optimal Experimental Design.
List, Sadoff, and Wagner provide rules of thumb for sample size, treatment assignment, and other design decisions in field experiments in this practical guide. It is a useful starting point for researchers planning their first experiment.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.
Wooldridge's graduate textbook is the standard reference for cross-section and panel data econometrics. Chapters 10-11 provide a thorough treatment of fixed effects, random effects, and related panel data methods, while later chapters cover general estimation methodology (MLE, GMM, M-estimation) with panel data applications throughout. The book covers both linear and nonlinear models with careful attention to assumptions.