MethodAtlas

Chapter 5 of 8

DAGs for Beginners

A visual tool for thinking about causal relationships — draw the problem, see the solution.

The Mystery: What if we could draw a picture of the training problem and see exactly what's going wrong?

The Map Before the Territory

In F3, we felt the pain of selection bias. In F4, we learned the precise language for what we are trying to estimate. But there is still a gap: when you sit down with a research question, how do you figure out what could go wrong? How do you decide which variables to control for and which to leave alone?

You need a map. That map is called a — a Directed Acyclic Graph.

A DAG is deceptively simple. It is a picture with arrows. But it will become one of the most useful tools in your research toolkit, because it forces you to be explicit about what you believe causes what. And once your beliefs are drawn on the page, there are precise rules — formalized through the do-calculus (Pearl, 2009) — that tell you whether your research design can identify a causal effect, or whether it is doomed.

What Is a DAG?

Let us unpack the name:

  • Directed: Every arrow points one way. A → B means "A causes B" (or at least: A affects B). The direction matters.
  • Acyclic: No loops. You cannot follow the arrows and end up back where you started. (The restriction is a simplification — feedback loops exist in reality — but for any given moment in time, DAGs are a powerful framework.)
  • Graph: It is a network of nodes (variables) and edges (arrows).

Here is the simplest possible DAG for our training mystery:

Training → Earnings

The DAG says: "Training causes Earnings." If this relationship were the whole story — if nothing else affected both training and earnings — we could simply compare trainees to non-trainees and be done. But we already know from F3 that this simple picture is not the whole story.

Building the Training Mystery DAG

Let us add what we know. People who enroll in job training tend to be more motivated. Motivation also affects earnings directly (motivated people work harder, network more, negotiate raises). So:

Motivation → Training
Motivation → Earnings
Training  → Earnings

Now we have three variables and three arrows. Motivation is a — it causes both the treatment (Training) and the outcome (Earnings). The structure is exactly the selection bias story from F3, but now we can see it.

What about education? People with more education are more likely to enroll in training and earn more:

Education  → Training
Education  → Earnings
Motivation → Training
Motivation → Earnings
Training   → Earnings

Every arrow you add is a claim about how the world works. DAGs do not tell you what the arrows are — you have to decide that based on theory, institutional knowledge, and prior research. The DAG is a tool for reasoning through the consequences of your assumptions.

Try It: Build the Training DAG

Use the interactive DAG builder below to construct the DAG for our training mystery. Start with Training and Earnings, then add Motivation, Education, and Prior Earnings as potential confounders. Draw the arrows that you believe reflect the true causal structure.

DAG Builder

Draw arrows between nodes to represent causal relationships. Click a node, then click another to draw an arrow from the first to the second.

Paths: How Variables Connect

In a DAG, two variables can be connected by a path — a sequence of nodes and edges linking them, regardless of which direction the arrows point. Consider Training and Earnings in our DAG. There are multiple paths:

  1. Direct path: Training → Earnings (the causal effect we want)
  2. Backdoor path: Training ← Motivation → Earnings (goes "backward" through Motivation)
  3. Backdoor path: Training ← Education → Earnings

The direct path is what we care about. The backdoor paths are the problem — they create spurious associations between Training and Earnings that have nothing to do with the causal effect of training.

Backdoor Paths: The Enemy

A is any path from treatment to outcome that starts with an arrow pointing into the treatment. In other words, it enters the treatment node through the "back door."

Why do backdoor paths matter? Because they transmit statistical association that is not causal. If Motivation causes both Training and Earnings, then Training and Earnings will be correlated even if Training has zero causal effect on Earnings. The correlation flows through Motivation.

The core insight of causal inference is the following: to identify a causal effect, you must block all backdoor paths while leaving the causal (directed) path open.

How do you "block" a path? By on (controlling for) a variable that sits on the backdoor path. If you control for Motivation, the backdoor path Training ← Motivation → Earnings is blocked, because you have held Motivation constant. The blocking is exactly what matching methods do: they compare treated and untreated units that look alike on the confounders, effectively closing the backdoor paths.

d-Separation: When Are Variables Independent?

Here is the key question a DAG answers: given what I am conditioning on, are Treatment and Outcome connected only through the causal path?

The formal tool for answering this question is called . In plain language:

Two variables are d-separated given a set of conditioning variables if there is no open path between them, once you account for what you are conditioning on. If two variables are d-separated, they are statistically independent (given your DAG assumptions).

A path is open if every variable along it transmits association. A path is blocked if at least one variable along it stops the flow.

There are three building blocks to learn:

1. Chains (Mediation): A → B → C

If you do not condition on B, the path is open: A and C are associated. If you do condition on B, the path is blocked: A and C are independent given B.

Example: Education → Job Quality → Earnings. If you control for Job Quality, you block the path from Education to Earnings that runs through jobs.

2. Forks (Confounding): A ← B → C

If you do not condition on B, the path is open: A and C are associated (spuriously). If you do condition on B, the path is blocked.

Example: Training ← Motivation → Earnings. Control for Motivation, and the spurious association disappears.

3. Colliders: A → B ← C

The third building block is the surprise. If you do not condition on B, the path is blocked — A and C are independent. But if you do condition on B, the path opens up, and A and C become associated.

The collider behavior is the opposite of what happens with chains and forks, and it trips up even experienced researchers.

Don't worry about the notation yet — here's what this means in words: Two nodes X and Y are d-separated by a set Z if every path between them is blocked — meaning the path contains either a chain or fork whose middle node is in Z, or a collider whose middle node (and its descendants) is NOT in Z.

Definition (d-separation). Let G be a DAG, and let X, Y be distinct nodes. Let Z be a (possibly empty) set of nodes not including X or Y. Then X and Y are d-separated by Z in G if and only if every path between X and Y is blocked by Z.

A path is blocked by Z if it contains at least one of:

  1. A chain AMBA \rightarrow M \rightarrow B or a fork AMBA \leftarrow M \rightarrow B where MZM \in Z
  2. A collider AMBA \rightarrow M \leftarrow B where MZM \notin Z and no descendant of MM is in ZZ

If X and Y are not d-separated by Z, they are d-connected given Z.

The Causal Markov Condition states that if the DAG correctly represents the data-generating process, then d-separation in the graph implies conditional independence in the probability distribution:

X ⁣ ⁣ ⁣YZ in the DAG    X ⁣ ⁣ ⁣YZ in PX \perp\!\!\!\perp Y \mid Z \text{ in the DAG} \implies X \perp\!\!\!\perp Y \mid Z \text{ in } P

The foundational result, developed by Pearl (2009), is what gives DAGs their power: graphical structure determines statistical properties.

Collider Bias: The Trap That Catches Everyone

Collider bias is so important — and so counterintuitive — that it deserves its own section. Let us use a famous example.

The "Hollywood Paradox." Suppose that among Hollywood actors, talent and attractiveness appear to be negatively correlated: the most talented actors are not the most attractive, and the most attractive actors are not the most talented. Does this mean that in the general population, being talented makes you less attractive?

No. Here is the DAG:

Talent → Hollywood Success ← Attractiveness

Hollywood Success is a — both Talent and Attractiveness point into it. In the general population, Talent and Attractiveness are probably independent (or only weakly correlated). But when you condition on Hollywood Success — by only looking at people who made it in Hollywood — you open the collider path. Among successful actors, those who are less attractive must have made it on talent, and vice versa. You see a negative correlation that does not exist in the population.

Back to the training mystery. Suppose we have:

Training → Job Placement ← Ability
Ability  → Earnings
Training → Earnings

If we control for Job Placement (a collider), we would introduce bias rather than remove it. The collider problem is why the reflex "control for everything" is dangerous. Some variables should be left alone.

Concept Check

You are studying whether a job training program increases earnings. You have data on Training, Earnings, Motivation (a confounder), and Job Placement (caused by both Training and Ability). Which variables should you control for?

Common Mistakes with DAGs

DAGs in Practice: What to Do

Here is a practical workflow for using DAGs in your own research:

  1. List your variables. Treatment, outcome, and every variable you think might be relevant (confounders, mediators, instruments, etc.).
  2. Draw the arrows. For each pair of variables, ask: "Does A cause B? Does B cause A? Or are they unrelated?" Be honest and use theory, not data, to justify the arrows.
  3. Identify all paths from treatment to outcome.
  4. Classify each path as causal (front-door) or non-causal (backdoor).
  5. Find a conditioning set that blocks all backdoor paths without opening collider paths and without blocking the causal path.
  6. Check feasibility. Can you actually measure and control for the variables in your conditioning set? If not, you may need a different identification strategy — which is exactly what F6 is about.

Key Takeaways

What Comes Next

We now have the language of causal inference (F4) and a visual tool for reasoning about causal structures (this page). But our training mystery is still unsolved. What strategies can we actually use to identify the causal effect of training on earnings? In the next page, we survey the full landscape of identification strategies — the complete toolkit of methods you will learn on this site.

Next Step: A Taxonomy of Identification Strategies — Survey the full landscape of causal inference methods and learn which tool to reach for.