Foundation·Chapter 3 of 8·12 min read

Chapter 3 of 8

Selection Bias and Confounding

The single biggest threat to your research — and the reason every method on this site exists.

The Mystery

The people who signed up for training were different. How different? Let's measure it.

Prerequisites: Why Causal Inference?, The Anatomy of a Research Design
Reading Time: ~12 min read · 9 sections · 4 interactive exercises
Next Up: Chapter 4: The Language of Identification

Let us return to our mystery.

A state government runs a job training program. You compare the earnings of people who participated to people who did not, and you find a $7,500 difference. But from the first page of this sequence, you already know the true causal effect is only $2,000. The remaining $5,500 gap is driven by — the systematic difference between the kinds of people who chose training and those who did not.

On this page, we are going to dissect selection bias until you understand it thoroughly. Selection bias is not just one topic among many. Selection bias is arguably the most pervasive threat to causal inference in observational research (Angrist & Pischke, 2009). Nearly every method you will learn on this site exists, in whole or in part, to address it.

Who Signs Up?

Imagine you are administering this training program. You set up a sign-up table at a community center. Who walks through the door?

People who are motivated to change their employment situation
People who have heard about the program (more connected, more informed)
People who can afford the time — they are not working three jobs or caring for a sick relative
People with more education — they are more comfortable in a classroom setting
People with higher prior earnings — they have the financial cushion to take time off for training

Now think about these characteristics. Every single one of them — motivation, social connections, time availability, education, prior earnings — also independently predicts future earnings. A motivated person with a college degree and a social network will likely earn more next year regardless of whether she takes a training course.

The pattern is the anatomy of selection bias: a variable (call it "motivation" or "ability" or "background") affects both the decision to get treated and the outcome you are measuring. In the causal inference literature, such a common cause of treatment and outcome is called a confounder. Later, when you learn to draw DAGs, you will be able to see exactly how confounders create spurious associations and what you need to do to block them.

Animated Explanation

Selection Into Treatment

Step 1(1/4)

People vary in motivation, education, and ability.

The Decomposition: What Your Estimate Actually Captures

Let us be precise about what happens when you compute the difference in average outcomes between the treated and untreated groups. As Holland (1986) showed in his foundational discussion of the "fundamental problem of causal inference," we can never observe both potential outcomes for the same unit. We can, however, decompose the naive comparison into two pieces:

\underbrace{E[Y \mid D=1] - E[Y \mid D=0]}_{\text{Naive comparison}} = \underbrace{E[Y(1) - Y(0) \mid D=1]}_{\text{ATT (causal effect on the treated)}} + \underbrace{E[Y(0) \mid D=1] - E[Y(0) \mid D=0]}_{\text{Selection bias}}

Do not let the notation intimidate you. Here is what each piece means in plain language:

The left side is what you can compute: the average earnings of trainees minus the average earnings of non-trainees. This quantity is the $7,500 from our example.

The first term on the right is what you want: the actual causal effect of training on the people who were trained. The first term is the .

The second term on the right is the problem: the difference in what trainees and non-trainees would have earned even without any training. The difference is selection bias. It reflects pre-existing differences between the groups. In our example, trainees would have averaged $28,333 without training (Maria: $28K, Aisha: $30K, Lin: $27K), versus $24,500 for non-trainees (James: $25K, David: $24K). So the selection bias term is $28,333 − $24,500 = $3,833, and the ATT is $3,667. Together: $3,667 + $3,833 = $7,500, exactly the naive comparison.

On the previous page, we attributed the entire $5,500 gap between the naive estimate and the ATE to "selection bias." That attribution was a useful simplification. The ATT decomposition here reveals that the $5,500 actually combines two sources: $3,833 from selection bias (pre-existing group differences) and $1,667 from treatment effect heterogeneity (training helped the treated group more than the average person).

More precisely, the ATE is a weighted average of the ATT and the ATU: $\text{ATE} = P(D=1) \times \text{ATT} + P(D=0) \times \text{ATU}$ . When ATT $\neq$ ATU, the naive comparison reflects not just selection bias but also treatment effect heterogeneity (Cunningham, 2021).

Feeling It: The Selection Bias Simulator

Reading about selection bias is one thing. Seeing it is another. The simulation below generates data from a world where you know the truth. You control how strongly a confounder (think: "motivation") affects both the probability of treatment and the outcome.

Try the following experiments:

Set both selection sliders to zero. The naive estimate should cluster around the true effect. When there is no selection, naive comparison works.
Increase selection into enrollment while keeping confounding strength at zero. The estimate stays accurate. Selection alone does not cause bias — it only matters if the characteristic driving selection also affects the outcome.
Increase both sliders. Now watch the naive estimate climb far above the true effect. This scenario is the classic case: motivated people both seek training and earn more.
Set the true effect to zero and increase both sliders. You will "find" a large positive effect of a program that does nothing. This outcome is how bad research happens.

Omitted Variable Bias: The Formal Name

When you run a regression of earnings on a training indicator, leaving out the confounder (motivation), the coefficient on training will be biased. In econometrics, this distortion is called — a standard result of Ordinary Least Squares (OLS) algebra. Relatedly, Heckman (1979) showed in a landmark paper that sample selection bias can be understood as a specification error in the regression equation, and proposed a correction based on modeling the selection process itself. The Heckman correction has been widely applied in strategic management research (Certo et al., 2016), though Wolfolds and Siegel (2019) warned that misapplying the Heckman correction — particularly when the exclusion restriction is not credible — can introduce more bias than it removes.

Here is the intuition in plain language, before any formula:

If you leave out a variable that is correlated with both your treatment and your outcome, your estimate of the treatment effect will absorb some (or all) of the omitted variable's influence. Your coefficient on treatment is too big or too small — and you have no way of knowing by how much, without knowing the omitted variable.

Let us make this concrete. Suppose the true model of earnings is:

\text{Earnings}_i = \beta_0 + \beta_1 \cdot \text{Training}_i + \beta_2 \cdot \text{Motivation}_i + \varepsilon_i

where $\beta_1$ is the true causal effect of training and $\beta_2$ captures the effect of motivation. But you typically cannot measure motivation, so you estimate:

\text{Earnings}_i = \alpha_0 + \alpha_1 \cdot \text{Training}_i + u_i

The coefficient $\alpha_1$ does not equal $\beta_1$ . It equals $\beta_1$ plus a bias term. The bias depends on two things: (1) how strongly motivation affects earnings ( $\beta_2$ ), and (2) how strongly motivation is correlated with training assignment.

Don't worry about the notation yet — here's what this means in words: When you leave out a variable that correlates with both your treatment and outcome, your treatment coefficient is biased by the product of two things: the omitted variable's effect on the outcome, and its correlation with the treatment.

Consider the true data-generating process:

Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \varepsilon_i

where $D_i$ is the treatment indicator and $X_i$ is an omitted variable (confounder). If we estimate the short regression omitting $X_i$ :

Y_i = \alpha_0 + \alpha_1 D_i + u_i

then the OLS estimator converges to:

\text{plim} \ \hat{\alpha}_1 = \beta_1 + \beta_2 \cdot \delta

where $\delta$ is the coefficient from the auxiliary regression of $X_i$ on $D_i$ :

X_i = \delta_0 + \delta D_i + v_i

The bias is $\beta_2 \cdot \delta$ . It has two components:

$\beta_2$ : How much the omitted variable affects the outcome (if motivation does not affect earnings, there is no bias regardless of selection)
$\delta$ : How strongly the omitted variable correlates with treatment (if motivated people are no more likely to train, there is no bias regardless of motivation's effect on earnings)

The bias is zero if either $\beta_2 = 0$ (the omitted variable does not affect the outcome) or $\delta = 0$ (the omitted variable is uncorrelated with treatment). It is large when both are large.

The sign of the bias is the product of the signs of $\beta_2$ and $\delta$ :

Positive $\beta_2$ (motivation helps earnings) × positive $\delta$ (motivated people train more) = positive bias (you overestimate the treatment effect)
Positive $\beta_2$ × negative $\delta$ = negative bias (you underestimate the treatment effect)

Direction of Bias: Signing the Problem

One of the most useful skills you will develop is the ability to sign the bias — to reason about whether omitting a variable makes your estimate too large or too small. This skill is called "signing the OVB."

For our training example:

Omitted Variable	Effect on Earnings ( $\beta_2$ )	Correlation with Training ( $\delta$ )	Direction of Bias
Motivation	Positive (motivated people earn more)	Positive (motivated people train more)	Upward (overestimate)
Health problems	Negative (illness reduces earnings)	Positive (unhealthy people may seek help)	Downward (underestimate)
Prior earnings	Positive (past success predicts future)	Positive (higher earners have resources to train)	Upward (overestimate)

Notice that different omitted variables can bias your estimate in different directions. In practice, you need to think about which confounders are most important and which direction they push. The informal exercise — sometimes called a "threat assessment" — is a standard part of any empirical paper's discussion section.

Concept Check

A researcher studies whether attending an elite university increases earnings. She compares the earnings of elite-university graduates to non-elite graduates, controlling for SAT scores. She omits family wealth from her regression. In which direction is her estimate of the elite-university effect biased?

Upward bias (overestimate)Downward bias (underestimate)No bias, because she controls for SAT scoresThe direction cannot be determined without more information

Confounding, Reverse Causation, and Measurement Error

Selection bias is one form of the broader problem called . In strategic management, Shaver (1998) demonstrated how failing to account for endogeneity leads to incorrect conclusions about strategy-performance relationships — a warning that reshaped how the field thinks about empirical research design. There are three main sources of endogeneity:

Confounding (what we have been discussing): a third variable affects both treatment and outcome, creating a spurious association.

: the outcome affects the treatment, rather than the other way around. For example, maybe people who expect to earn more next year (perhaps they have a job offer) are less likely to enroll in training (they do not need it). In this case, higher expected future earnings cause non-participation, and we would underestimate the effect of training.

: the treatment variable is measured with error. If you typically cannot observe who actually received training and instead use a noisy proxy, your estimate will be biased — typically toward zero (this pattern is called ). Classical measurement error in the treatment variable is a well-known source of inconsistency in OLS.

All three problems create a correlation between treatment and the error term that prevents a naive regression from recovering the causal effect. Different methods on this site address different sources of endogeneity; many are designed primarily to address confounding, while some can also help with reverse causation or measurement error in particular settings.

Why Selection Bias Matters for Your Research

You might be thinking: "I get it. Selection bias is bad. But my study is not about job training."

It does not matter. Selection bias arises in virtually every observational setting:

Does CEO overconfidence reduce firm value? Overconfident CEOs may also take over firms in specific situations (selection into the "treatment" of overconfidence is non-random).
Does remote work reduce productivity? Workers who choose remote work may be more or less productive to begin with.
Does social media use cause depression? Depressed individuals may use social media differently.
Does foreign aid promote economic growth? Aid flows are directed toward specific types of countries (often the poorest and most politically aligned).

In every case, the fundamental structure is the same: the people/firms/countries that receive "treatment" are selected, and that selection is correlated with the outcome. If you typically cannot account for this selection, you typically cannot draw causal conclusions.

Selection bias is not a niche topic. It is the central challenge of your career as an empirical researcher (Angrist & Pischke, 2009).

✓Key Takeaways

Key Takeaways

Selection bias arises when the characteristics that drive treatment assignment also affect the outcome. It is a problem of confounders — common causes of treatment and outcome.
The naive comparison decomposes into the causal effect plus selection bias. You observe the sum and want just the first piece.
Omitted Variable Bias is the regression formalization of selection bias. Leaving out a relevant confounder biases your treatment coefficient by $\beta_2 \cdot \delta$ — the product of the confounder's effect on the outcome and its correlation with treatment.
You can sign the bias by reasoning about the direction of these two components. This ability is a fundamental skill in empirical research.
Controlling for observables helps but cannot solve the problem completely. Unobservable confounders require research designs — such as difference-in-differences, instrumental variables, or regression discontinuity — not just more control variables (Angrist & Pischke, 2009).
Every causal inference method is, at its core, a strategy for eliminating or reducing selection bias. Randomized experiments eliminate it by design. Natural experiments approximate randomization. Other methods rely on specific assumptions to isolate causal variation from selection.

→What Comes Next

Now that you feel the problem — you have seen how selection bias distorts estimates, you understand the OVB formula, you can sign the direction of bias — it is time to equip you with the precise vocabulary that researchers use to talk about causal inference.

What exactly do we mean by the "causal effect"? Are we talking about the effect on everyone, or just the people who were treated? What does it mean for a research design to be "identified"? What is "exogenous variation," and why does everyone keep talking about it?

The next page gives you the language of identification — the conceptual vocabulary that makes the rest of this site (and every empirical seminar you attend) comprehensible.

Next Step: The Language of Identification — Estimand, estimator, estimate. ATE, ATT, LATE. The precise vocabulary you need.

Who Signs Up?#

Selection Into Treatment

The Decomposition: What Your Estimate Actually Captures#

Feeling It: The Selection Bias Simulator#

Omitted Variable Bias: The Formal Name#

Direction of Bias: Signing the Problem#

Confounding, Reverse Causation, and Measurement Error#

Why Selection Bias Matters for Your Research#

✓Key Takeaways#

→What Comes Next#