Chapter 3 of 8
Selection Bias and Confounding
The single biggest threat to your research — and the reason every method on this site exists.
Let us return to our mystery.
A state government runs a job training program. You compare the earnings of people who participated to people who did not, and you find a $7,500 difference. But from the first page of this sequence, you already know the true causal effect is only $2,000. The remaining $5,500 is — the systematic difference between the kinds of people who chose training and those who did not.
On this page, we are going to dissect selection bias until you understand it thoroughly. Selection bias is not just one topic among many. Selection bias is the single biggest threat to your research. Every method you will learn on this site exists to fight it.
Who Signs Up?
Imagine you are administering this training program. You set up a sign-up table at a community center. Who walks through the door?
- People who are motivated to change their employment situation
- People who have heard about the program (more connected, more informed)
- People who can afford the time — they are not working three jobs or caring for a sick relative
- People with more education — they are more comfortable in a classroom setting
- People with higher prior earnings — they have the financial cushion to take time off for training
Now think about these characteristics. Every single one of them — motivation, social connections, time availability, education, prior earnings — also independently predicts future earnings. A motivated person with a college degree and a social network will likely earn more next year regardless of whether she takes a training course.
The pattern is the anatomy of selection bias: a variable (call it "motivation" or "ability" or "background") affects both the decision to get treated and the outcome you are measuring. It is a common cause of treatment and outcome. In the causal inference literature, such a variable is called a confounder. Later, when you learn to draw DAGs, you will be able to see exactly how confounders create spurious associations and what you need to do to block them.
Selection Into Treatment
People vary in motivation, education, and ability.
The Decomposition: What Your Estimate Actually Captures
Let us be precise about what happens when you compute the difference in average outcomes between the treated and untreated groups. As Holland (1986) showed in his foundational discussion of the "fundamental problem of causal inference," we can never observe both potential outcomes for the same unit. We can, however, decompose the naive comparison into two pieces:
Do not let the notation intimidate you. Here is what each piece means in plain language:
The left side is what you can compute: the average earnings of trainees minus the average earnings of non-trainees. This quantity is the $7,500 from our example.
The first term on the right is what you want: the actual causal effect of training on the people who were trained. The first term is the .
The second term on the right is the problem: the difference in what trainees and non-trainees would have earned even without any training. The difference is selection bias. It reflects pre-existing differences between the groups. In our example, this difference is the $5,500 that comes from trainees being more motivated, educated, and connected.
Feeling It: The Selection Bias Simulator
Reading about selection bias is one thing. Seeing it is another. The simulation below generates data from a world where you know the truth. You control how strongly a confounder (think: "motivation") affects both the probability of treatment and the outcome.
Selection Bias Decomposition
Adjust the degree of selection and watch the naive estimate diverge from the true treatment effect. The gap between the two lines IS the selection bias. Notice: even a moderate amount of selection can make your estimate wildly wrong.
Try the following experiments:
- Set both selection sliders to zero. The naive estimate should cluster around the true effect. When there is no selection, naive comparison works.
- Increase selection into enrollment while keeping confounding strength at zero. The estimate stays accurate. Selection alone does not cause bias — it only matters if the characteristic driving selection also affects the outcome.
- Increase both sliders. Now watch the naive estimate climb far above the true effect. This scenario is the classic case: motivated people both seek training and earn more.
- Set the true effect to zero and increase both sliders. You will "find" a large positive effect of a program that does nothing. This outcome is how bad research happens.
Omitted Variable Bias: The Formal Name
When you run a regression of earnings on a training indicator, leaving out the confounder (motivation), the coefficient on training will be biased. In econometrics, this distortion is called . Heckman (1979) formalized this problem in a landmark paper, showing that selection bias can be understood as a specification error in the regression equation — and proposed a correction based on modeling the selection process itself.
Here is the intuition in plain language, before any formula:
If you leave out a variable that is correlated with both your treatment and your outcome, your estimate of the treatment effect will absorb some (or all) of the omitted variable's influence. Your coefficient on treatment is too big or too small — and you have no way of knowing by how much, without knowing the omitted variable.
Let us make this concrete. Suppose the true model of earnings is:
where is the true causal effect of training and captures the effect of motivation. But you cannot measure motivation, so you estimate:
The coefficient does not equal . It equals plus a bias term. The bias depends on two things: (1) how strongly motivation affects earnings (), and (2) how strongly motivation is correlated with training assignment.
Don't worry about the notation yet — here's what this means in words: When you leave out a variable that correlates with both your treatment and outcome, your treatment coefficient is biased by the product of two things: the omitted variable's effect on the outcome, and its correlation with the treatment.
Consider the true data-generating process:
where is the treatment indicator and is an omitted variable (confounder). If we estimate the short regression omitting :
then the OLS estimator converges to:
where is the coefficient from the auxiliary regression of on :
The bias is . It has two components:
- : How much the omitted variable affects the outcome (if motivation does not affect earnings, there is no bias regardless of selection)
- : How strongly the omitted variable correlates with treatment (if motivated people are no more likely to train, there is no bias regardless of motivation's effect on earnings)
The bias is zero if either (the omitted variable does not affect the outcome) or (the omitted variable is uncorrelated with treatment). It is large when both are large.
The sign of the bias is the product of the signs of and :
- Positive (motivation helps earnings) × positive (motivated people train more) = positive bias (you overestimate the treatment effect)
- Positive × negative = negative bias (you underestimate the treatment effect)
Direction of Bias: Signing the Problem
One of the most useful skills you will develop is the ability to sign the bias — to reason about whether omitting a variable makes your estimate too large or too small. This skill is called "signing the OVB."
For our training example:
| Omitted Variable | Effect on Earnings () | Correlation with Training () | Direction of Bias |
|---|---|---|---|
| Motivation | Positive (motivated people earn more) | Positive (motivated people train more) | Upward (overestimate) |
| Health problems | Negative (illness reduces earnings) | Positive (unhealthy people may seek help) | Downward (underestimate) |
| Prior earnings | Positive (past success predicts future) | Positive (higher earners have resources to train) | Upward (overestimate) |
Notice that different omitted variables can bias your estimate in different directions. In practice, you need to think about which confounders are most important and which direction they push. The informal exercise — sometimes called a "threat assessment" — is a standard part of any empirical paper's discussion section.
A researcher studies whether attending an elite university increases earnings. She compares the earnings of elite-university graduates to non-elite graduates, controlling for SAT scores. She omits family wealth from her regression. In which direction is her estimate of the elite-university effect biased?
Confounding, Reverse Causation, and Measurement Error
Selection bias is one form of the broader problem called . There are three main sources of endogeneity:
Confounding (what we have been discussing): a third variable affects both treatment and outcome, creating a spurious association.
Reverse causation: the outcome affects the treatment, rather than the other way around. For example, maybe people who expect to earn more next year (perhaps they have a job offer) are less likely to enroll in training (they do not need it). In this case, higher expected future earnings cause non-participation, and we would underestimate the effect of training.
Measurement error: the treatment variable is measured with error. If you cannot observe who actually received training and instead use a noisy proxy, your estimate will be biased — typically toward zero (this pattern is called attenuation bias). Classical measurement error in the treatment variable is a well-known source of inconsistency in OLS.
All three problems create a correlation between treatment and the error term that prevents a naive regression from recovering the causal effect. The methods on this site address all three.
Why This Matters for Your Research
You might be thinking: "I get it. Selection bias is bad. But my study is not about job training."
It does not matter. Selection bias is everywhere:
- Does CEO overconfidence reduce firm value? Overconfident CEOs may also take over firms in specific situations (selection into the "treatment" of overconfidence is non-random).
- Does remote work reduce productivity? Workers who choose remote work may be more or less productive to begin with.
- Does social media use cause depression? Depressed individuals may use social media differently.
- Does foreign aid promote economic growth? Aid flows are directed toward specific types of countries (often the poorest and most politically aligned).
In every case, the fundamental structure is the same: the people/firms/countries that receive "treatment" are selected, and that selection is correlated with the outcome. If you cannot account for this selection, you cannot draw causal conclusions.
Selection bias is not a niche topic. It is the central challenge of your career as an empirical researcher (Angrist & Pischke, 2009).
Key Takeaways
What Comes Next
Now that you feel the problem — you have seen how selection bias distorts estimates, you understand the OVB formula, you can sign the direction of bias — it is time to equip you with the precise vocabulary that researchers use to talk about causal inference.
What exactly do we mean by the "causal effect"? Are we talking about the effect on everyone, or just the people who were treated? What does it mean for a research design to be "identified"? What is "exogenous variation," and why does everyone keep talking about it?
The next page gives you the language of identification — the conceptual vocabulary that makes the rest of this site (and every empirical seminar you attend) comprehensible.
Next Step: The Language of Identification — Estimand, estimator, estimate. ATE, ATT, LATE. The precise vocabulary you need.