MethodAtlas
Guide

External Validity and Generalization

When and how to generalize causal estimates beyond the study population. PATE vs SATE vs LATE, site-selection bias, reweighting for target populations, and the limits of extrapolation.

The Generalization Problem

You have a credible causal estimate -- strong internal validity, tight confidence intervals, a plausible identification strategy. But does the estimate apply to the population you actually care about? This question is the external validity question, and it is arguably the most neglected dimension of empirical work.

Internal validity asks whether the estimated effect is causal within the study. External validity asks whether that effect generalizes -- to other populations, settings, time periods, or treatment implementations. A randomized trial in Oregon may tell you precisely what Medicaid expansion did for low-income Oregonians in 2008, but it does not automatically tell you what Medicaid expansion would do in Texas in 2026.

The tension is real: the designs that provide the strongest internal validity (randomized experiments, sharp regression discontinuities, narrow natural experiments) often have the weakest claims to external validity because they exploit highly specific variation. Meanwhile, designs that draw on broader populations (large-scale observational studies) face greater threats to internal validity. There is no free lunch.

This guide provides a framework for thinking carefully about what your estimate generalizes to, when generalization is defensible, and what tools exist to extend causal findings beyond the original study population.

PATE vs SATE vs LATE

The first step in any external validity discussion is to be precise about the estimand -- the causal quantity your design actually identifies. Three estimands dominate the literature, and they differ fundamentally in the population they describe.

Population Average Treatment Effect (PATE)

The PATE is the average treatment effect across the entire population of interest:

PATE=E[Yi(1)Yi(0)]\text{PATE} = E[Y_i(1) - Y_i(0)]

where the expectation is taken over the full population. This estimand is usually what policymakers want: if we roll out this program to everyone, what is the average effect?

A well-designed experiment with a random sample from the target population identifies the PATE directly. In practice, this is rare. Most experiments recruit convenience samples, draw from specific geographic areas, or study populations that are not representative of the policy-relevant target.

Sample Average Treatment Effect (SATE)

The SATE is the average treatment effect for the specific units in the study:

SATE=1ni=1n[Yi(1)Yi(0)]\text{SATE} = \frac{1}{n} \sum_{i=1}^{n} [Y_i(1) - Y_i(0)]

A randomized experiment with perfect compliance identifies the SATE by design, regardless of how the sample was selected. The question is whether the SATE equals the PATE -- and it does only if the sample is representative of the target population, or if treatment effects are homogeneous.

The gap between the SATE and the PATE is a function of two things: (a) how different the study sample is from the population on characteristics that moderate the treatment effect, and (b) how much the treatment effect actually varies across units. If either is zero -- the sample is perfectly representative, or effects are homogeneous -- the SATE equals the PATE. In practice, neither condition holds exactly, and the question is how much the gap matters.

Local Average Treatment Effect (LATE)

The LATE, identified by instrumental variables, is the average treatment effect for compliers -- units whose treatment status is shifted by the instrument:

LATE=E[Yi(1)Yi(0)complier]\text{LATE} = E[Y_i(1) - Y_i(0) \mid \text{complier}]

The complier subpopulation is instrument-specific and generally unobservable. A draft-lottery IV identifies the LATE for men who served because they were drafted but would not have volunteered. A financial aid IV identifies the LATE for students who enrolled because they received aid but would not have enrolled otherwise. These are different subpopulations, and their treatment effects can differ.

Comparison Table

EstimandPopulationIdentified byKey assumption for generalization
PATEEntire target populationRandom sample + random assignmentSample is representative of the target
SATEStudy participants onlyRandom assignment within sampleNone (but may not generalize beyond sample)
LATECompliers (instrument-specific)Instrument + monotonicityComplier effects representative of broader effects
CATEConditional subgroupHeterogeneity analysisSubgroup is policy-relevant
ATTTreated unitsSelection-on-observables, DiDTreated units are the policy target

Site-Selection Bias

Site-selection bias arises when the sites or contexts chosen for a study are systematically unrepresentative of the broader population of sites where a policy might be implemented. Site-selection bias is one of the most important and least discussed threats to external validity.

Why It Occurs

Studies are rarely conducted in randomly selected locations. Researchers study places where data are available, where natural experiments occur, where cooperating agencies operate, or where programs were first adopted. Each of these selection mechanisms can introduce bias:

  • Early adopters are different. States or districts that adopt a policy early may have stronger institutional capacity, more political will, or populations that benefit more from the intervention. Estimating effects from early adopters and extrapolating to later adopters (or to populations that resist adoption) can be misleading.

  • Natural experiments occur in specific places. The settings where a policy discontinuity, a court ruling, or a natural disaster creates useful variation are not randomly selected from the universe of possible settings. The effect of a minimum wage increase estimated from a border county in New Jersey may not apply to rural Mississippi.

  • Cooperating sites are positively selected. In multi-site trials, the sites that agree to participate may be those with the most motivated staff, the most resources, or the most favorable conditions for the intervention to succeed.

  • Data availability is non-random. High-quality administrative data exist in some countries, states, or institutions but not others. The Scandinavian registry studies that dominate some literatures reflect a very specific institutional context.

Allcott (2015): The Canonical Example

Allcott's study of the Opower energy conservation program provides a vivid example. Opower initially partnered with utilities that were the most enthusiastic about conservation -- utilities in liberal-leaning areas with environmentally conscious customers. The early treatment effects were large. But as the program scaled to less enthusiastic utilities, treatment effects shrank substantially. The initial estimates overstated the effect that would obtain at scale by roughly 50%.

This pattern -- declining effects as a program scales from enthusiastic early adopters to reluctant later adopters -- is likely widespread but rarely documented because most interventions are studied only in their initial implementation.

Diagnosing Site-Selection Bias

You cannot eliminate site-selection bias through econometric technique alone, but you can assess its likely severity:

  1. Compare study sites to the target population. Tabulate observable characteristics (demographics, institutional features, baseline outcomes) of study sites versus the full population of potential sites. Large differences are a warning sign.

  2. Examine treatment effect heterogeneity. If effects vary substantially across sites within the study, they likely also vary across the boundary between study sites and non-study sites.

  3. Look for dose-response in adoption timing. If early-adopting sites show larger effects than later-adopting sites, this suggests the full-scale effect would be smaller than the initial estimate.

  4. Consider the selection mechanism. Ask why these particular sites were studied. Is it because they had favorable conditions for the intervention? Because the intervention was first implemented where it was expected to work best?

A Taxonomy of Selection Mechanisms

Selection mechanismExampleLikely direction of bias
Early adopterStates that first expanded MedicaidOverstates effects (strongest capacity, most political support)
Cooperating siteSchools that agreed to participate in RCTOverstates effects (most motivated, best implementation)
Data availabilityScandinavian registry dataDirection unclear; reflects specific institutional context
Natural experiment locationCounties near state borders for minimum wage studiesDirection unclear; border counties may differ from interior
Crisis-driven variationFinancial crisis as a shockMay overstate or understate; crisis conditions differ from normal

The key insight is that site-selection bias is not always upward. In some cases, studies are conducted in settings where the intervention faces unusually harsh conditions (e.g., a program tested during a recession when labor markets are tight). The direction depends on the specific selection mechanism.

Reweighting for External Validity

When the study sample differs from the target population on observable characteristics, you can reweight the study sample to match the target population -- just as inverse probability weighting reweights treated and control groups for internal validity.

The Core Idea

Suppose you have experimental estimates from a non-representative sample. If you know (a) the distribution of effect-modifying covariates in the study sample and (b) the distribution of those same covariates in the target population, you can reweight the study estimates to approximate the PATE.

Formally, let S=1S = 1 indicate membership in the study sample and S=0S = 0 indicate the broader target population. The reweighting estimator constructs weights:

wi=P(S=0Xi)P(S=1Xi)w_i = \frac{P(S = 0 \mid X_i)}{P(S = 1 \mid X_i)}

These weights upweight study observations that look like the target population and downweight those that are overrepresented relative to the target.

When Reweighting Works

Reweighting requires a strong assumption: treatment effect heterogeneity depends only on observables. That is, conditional on the covariates XX used for reweighting, treatment effects are the same in the study population and the target population:

E[Y(1)Y(0)X,S=1]=E[Y(1)Y(0)X,S=0]E[Y(1) - Y(0) \mid X, S = 1] = E[Y(1) - Y(0) \mid X, S = 0]

This condition is sometimes called the generalizability or transportability assumption. It fails when unobserved factors that vary across populations also modify treatment effects. For example, if the study was conducted in a high-capacity school district and treatment effects depend on institutional capacity (which you cannot observe or measure well), reweighting on student demographics will not fix the problem.

When Reweighting Fails

  • Unobserved effect modifiers. If treatment effects depend on variables you cannot measure, no amount of reweighting on observables will close the gap.
  • No overlap. If the target population includes covariate values not present in the study sample (regions of the covariate space with no study participants), extrapolation is required and reweighting breaks down.
  • Extreme weights. When the study sample and target population are very different, some observations receive very large weights, inflating variance and making the reweighted estimate unstable. This instability is directly analogous to the extreme-weight problem in propensity score methods.

Key Papers on Reweighting for Generalization

Several recent papers have formalized the reweighting approach:

  • Stuart et al. (2011) develop methods for generalizing randomized trial results to a target population using propensity scores for study participation. They show that the method works well when the study sample and target population overlap substantially but produces large variance when they do not.

  • Buchanan et al. (2018) extend these methods to settings where the target population is defined by a survey (e.g., the American Community Survey) and the study population is an RCT sample. They provide practical guidance on selecting covariates for the reweighting model.

  • Dahabreh et al. (2020) develop a comprehensive framework for transporting causal inferences from a randomized trial to a new target population, with formal conditions for identification and semiparametric efficient estimators.

The common thread is that reweighting is only as good as the assumption that the covariates used for reweighting capture all relevant effect modification. This assumption is fundamentally untestable, just as the unconfoundedness assumption in matching is untestable.

When LATE Is Policy-Relevant

A common critique of IV estimation is that the LATE -- the effect for compliers -- is not "policy-relevant" because it applies to a subpopulation that is defined by the instrument and cannot be directly identified. Angrist and others have pushed back on this view, arguing that in many settings the LATE is precisely the effect a policymaker needs.

The Angrist Argument

The core argument runs as follows: when a policy is the instrument, the LATE for that policy is exactly the treatment effect for the people whose behavior the policy actually changes. Consider these examples:

  • Draft lottery and military service. The LATE from the draft-lottery IV is the effect of military service for men who were induced to serve by the draft. If a policymaker is considering reinstating the draft, this is the relevant subpopulation -- those who would serve under compulsion but not voluntarily.

  • Financial aid and college enrollment. The LATE from a financial aid IV is the effect of college for students who enroll because of aid but would not enroll otherwise. If a policymaker is considering expanding financial aid, these are exactly the students who would be affected.

  • Compulsory schooling and education. The LATE from compulsory schooling laws is the effect of additional education for people who stay in school only because the law compels them. If a policymaker is considering raising the compulsory schooling age, this is the relevant margin.

In each case, the LATE answers the question: what happens to the people whose behavior the policy changes? This estimand is often more relevant than the PATE, which averages over people who would take the treatment regardless (always-takers) and people who would never take it (never-takers).

When LATE Is Not Enough

The Angrist argument is compelling when the proposed policy closely mirrors the instrument. It breaks down when:

  • The policy differs from the instrument. A draft-lottery LATE tells you about conscription, not about the effect of military service for people who enlist in response to better pay or benefits. Different policy levers induce different complier populations, and their treatment effects can differ.

  • Scaling changes the effect. Even if the LATE correctly captures the effect for the marginal individual, the effect may change when the policy is implemented at scale. General equilibrium effects (e.g., flooding the labor market with newly educated workers) can alter the returns to treatment.

  • You need the effect for a different subpopulation. If the policy target is a different group than the compliers (e.g., you want to know the effect for always-takers, or for a population that was not exposed to the instrument), the LATE is not informative without additional assumptions.

Marginal Treatment Effects (MTE)

The MTE framework, developed by Heckman and Vytlacil, provides a bridge between LATE and more general estimands. The MTE defines the treatment effect as a function of the unobserved propensity to select into treatment:

MTE(x,uD)=E[Y(1)Y(0)X=x,UD=uD]\text{MTE}(x, u_D) = E[Y(1) - Y(0) \mid X = x, U_D = u_D]

where UDU_D is the unobserved resistance to treatment (higher values mean less likely to be treated). Different instruments identify different portions of the MTE curve -- they weight different points along the UDU_D distribution. The LATE is a weighted average of MTEs for the complier range.

If you can trace out enough of the MTE curve (using multiple instruments or a continuous instrument with wide support), you can in principle recover the ATE, ATT, or treatment effects for any policy-relevant subpopulation by integrating the MTE with appropriate weights. This recovery is demanding in practice but provides a conceptual framework for understanding how different estimands relate to each other.

The key insight from the MTE framework is that different policy instruments -- even those targeting the same treatment -- can produce different average effects because they move different people into treatment. A draft-lottery IV and a patriotic-appeal IV would identify different LATEs even though both instrument for military service, because they induce different subpopulations to serve. Understanding this point is essential for interpreting and comparing IV estimates across studies.

Concept Check

A state government is considering mandating health insurance coverage. A researcher estimates the effect of health insurance on health outcomes using Medicaid lottery data. The IV estimate (LATE) shows a positive but modest effect. A critic argues that the LATE is uninformative for policy because it only applies to the complier subpopulation. How should you evaluate this critique?

Multi-Site Experiments and Systematic Replication

The most direct approach to external validity is to run the same study in multiple contexts. Multi-site experiments and systematic replications address site-selection bias head-on by varying the settings in which an intervention is tested.

Multi-Site Randomized Trials

In a multi-site trial, the same intervention is randomly assigned within each of several sites (schools, clinics, cities). This design provides:

  • Site-specific treatment effects. You can estimate the effect at each site and examine heterogeneity directly.
  • An overall average effect that is more representative than any single-site estimate, provided the sites are reasonably diverse.
  • A variance estimate for the distribution of effects across sites, which quantifies the uncertainty about generalization.

The key question in multi-site trials is how the sites were selected. If sites were purposively chosen (e.g., the most cooperative schools), the multi-site average is still subject to site-selection bias -- just less so than a single-site study.

Cross-Study Heterogeneity

When multiple independent studies estimate the effect of the same treatment, comparing their results provides evidence on external validity. Consistent effects across diverse settings strengthen the case for generalizability. Widely varying effects suggest that context matters and extrapolation is risky.

Meta-analysis formalizes this comparison, but a simple and underrated approach is to tabulate the study populations, designs, and effect sizes side by side and ask: do the effects vary in ways that correlate with observable features of the study context?

The Replication Crisis and External Validity

The replication crisis in social science has reinforced the importance of external validity. Many celebrated findings failed to replicate not because the original study was fraudulent or poorly designed, but because the effect was context-dependent and did not generalize beyond the original sample and setting. This pattern is precisely an external validity failure.

Systematic replication -- pre-registered replications in new populations and settings -- is the most credible way to assess whether a finding generalizes. Single studies, no matter how well-designed, provide limited evidence about external validity.

Bayesian Approaches to Generalization

An alternative to frequentist reweighting is a Bayesian hierarchical model that treats site-specific effects as draws from a distribution. This approach:

  • Estimates the distribution of effects across sites, not just the average.
  • Provides a predictive distribution for the effect in a new site, which naturally incorporates uncertainty about cross-site heterogeneity.
  • Can incorporate site-level covariates (population size, baseline outcome levels, implementation characteristics) as predictors of effect heterogeneity.

The Bayesian prediction for a new site is typically wider (more uncertain) than a simple average across sites, which honestly reflects the additional uncertainty involved in generalization.

Designing for Generalizability

Several strategies can strengthen the external validity of a study at the design stage:

  1. Sample from the target population. If you want to know the effect of a job training program for all unemployed workers in a state, sample from the state's unemployment rolls rather than recruiting from a single office.

  2. Stratify across contexts. If you suspect effects vary across urban/rural settings or across demographic groups, stratify your sample to ensure representation of each context.

  3. Measure potential effect modifiers. Even if you cannot control which sites are in your study, measuring characteristics that plausibly moderate treatment effects allows ex-post heterogeneity analysis and reweighting.

  4. Pre-register heterogeneity analyses. Specify in advance which subgroups and moderators you will examine, to avoid the appearance of data mining.

  5. Consider adaptive designs. In sequential multi-site trials, use information from early sites to guide site selection and sample allocation in later sites. This can improve efficiency for estimating the treatment effect distribution.

Concept Check

A multi-site job training experiment is conducted in 12 cities. The overall average effect is a $1,500 increase in annual earnings (p = 0.003). But the site-specific effects range from -$500 to +$4,200. A policymaker asks whether the program would work in a 13th city that was not in the study. What is the most appropriate response?

Limits of Extrapolation

External validity ultimately requires extrapolation -- projecting a causal relationship beyond the conditions under which it was observed. Extrapolation is unavoidable in applied work, but it is important to be honest about its limits.

What Cannot Be Resolved by Data

Some external validity questions cannot be settled with observational or experimental data from the study context:

  • General equilibrium effects. The effect of a small-scale program may differ from the effect at scale because scaling changes market conditions, peer effects, or institutional responses. A job training program that works well for 1,000 participants may be less effective for 100,000 if the labor market cannot absorb that many newly trained workers.

  • Temporal instability. Treatment effects can change over time as institutions, norms, and populations evolve. The effect of a college degree on earnings in 1980 may not apply in 2026 because the labor market has changed.

  • Implementation fidelity. A program that works in a carefully monitored trial may fail when implemented by overburdened agencies with less training, fewer resources, and less motivation. The gap between efficacy (effect under ideal conditions) and effectiveness (effect under real-world conditions) is an external validity concern.

  • Hawthorne and experimenter effects. Study participants who know they are being observed may behave differently from people in a scaled-up program. This bias is a specific form of the efficacy-effectiveness gap.

The Role of Theory

Theory -- even informal theory -- is essential for disciplined extrapolation. Without a model of why the treatment works, you have no basis for predicting whether it will work in a new setting. Consider two cases:

  • A tutoring program increases test scores. If the mechanism is one-on-one attention from trained tutors, you might predict smaller effects in a setting where tutor quality is lower or student-tutor ratios are higher. Theory about the mechanism guides the extrapolation.

  • A tax credit increases small business formation. If the mechanism is relaxing credit constraints, you might predict larger effects in areas with less access to capital and smaller effects where credit is already abundant. The mechanism provides a basis for conditional prediction.

Without mechanistic understanding, extrapolation is pure guesswork. This limitation is one reason why mediation analysis and mechanism investigation, while difficult to do credibly, have value for external validity.

Structural vs Reduced-Form Approaches

Structural models provide one framework for extrapolation. By estimating the parameters of an economic model (preferences, technology, constraints), you can simulate counterfactuals in new settings by changing the model's inputs. This strategy is the approach taken in industrial organization, trade, and some areas of public finance.

The limitation is that structural models require strong assumptions about functional forms and the stability of deep parameters. If these assumptions are wrong, the extrapolation is wrong.

Reduced-form causal inference takes a more agnostic approach: estimate the effect of the intervention as implemented and be transparent about the limits of generalization. The tradeoff is that reduced-form estimates are more credible for the study setting but less portable across settings.

There is no universally correct approach. The best practice is to combine credible reduced-form evidence on internal validity with explicit analysis of the assumptions required for generalization, and to be transparent about which of those assumptions are testable and which are not.

Extrapolation vs Interpolation

A useful distinction is between interpolation (predicting effects within the range of observed conditions) and extrapolation (predicting effects outside that range). Multi-site experiments that cover diverse contexts enable interpolation to new sites with similar characteristics. Extrapolation to truly novel contexts -- a different country, a different era, a fundamentally different institutional setting -- is inherently more speculative.

When presenting results, be explicit about whether your generalization claim involves interpolation or extrapolation. Interpolation within the observed range of site characteristics is defensible with appropriate uncertainty quantification. Extrapolation to settings outside the observed range requires stronger assumptions and should be flagged as such.

Guided Exercise

Assessing External Validity of an IV Estimate

Angrist and Evans (1998) estimate the effect of having a third child on mothers' labor supply using the sex composition of the first two children as an instrument. Parents whose first two children are the same sex are more likely to have a third child (they prefer a mixed-sex sibship). The IV estimate shows that having a third child reduces labor supply. You are asked to evaluate the external validity of this finding for modern family policy.

What estimand does this IV design identify?

Describe the complier population in one sentence.

Name one reason this estimate might not generalize to a child subsidy policy that encourages third births.

Name one reason external validity might be weaker if this study were applied to a different decade.

Decision Flowchart: Can You Generalize Your Estimate?

Summary

External validity is not a binary property -- it is a continuum that depends on how similar the target population is to the study population, how heterogeneous treatment effects are, and how much the implementation context matters. No single study can establish external validity definitively. The strongest evidence for generalizability comes from systematic replication across diverse settings.

The practical takeaways:

  1. Be precise about your estimand. State whether you are estimating a PATE, SATE, LATE, or ATT, and describe the population it applies to.

  2. Confront site-selection bias directly. Ask why your study was conducted where it was, and whether the study setting is representative of the policy-relevant context.

  3. Reweight when possible, but respect its limits. Reweighting can adjust for observable differences between study and target populations, but it cannot fix unobserved heterogeneity or extrapolation beyond the support of the data.

  4. LATE can be policy-relevant. When the policy mirrors the instrument, the LATE is the effect for the people whose behavior the policy changes. But verify that the proposed policy and the instrument induce the same complier population.

  5. Replicate in new settings. Multi-site experiments and systematic replications provide the most direct evidence on external validity. A finding that holds across diverse contexts is more credible than one observed in a single study.

  6. Be honest about what you cannot know. General equilibrium effects, temporal instability, and implementation fidelity are real concerns that cannot be resolved by econometric technique. Name them, discuss their likely direction, and let the reader judge.