Foundation·Chapter 7 of 8·6 min read

Chapter 7 of 8

Working with Data

Loading, cleaning, reshaping, and constructing variables — the unglamorous but essential work.

The Mystery

Before we can analyze anything, we need to understand what good data looks like.

Prerequisites: A Taxonomy of Identification Strategies
Reading Time: ~6 min read · 12 sections · 1 interactive exercise
Next Up: Chapter 8: The Credibility Revolution

From Theory to Practice

You now have the conceptual foundation: you understand causal inference, selection bias, identification language, DAGs, and the full taxonomy of methods. But before you can implement any of these methods, you need to be comfortable doing something more mundane and equally important: working with data.

This page covers the practical skills that bridge the gap between understanding a method and actually running it. As Angrist and Pischke (2009) emphasize, even the most elegant identification strategy fails if the data are not properly prepared. Imbens and Rubin (2015) also devote substantial attention to the practical steps of data preparation that precede any causal analysis. These skills are rarely taught formally — most PhD students pick them up by trial, error, and occasional panic the night before a deadline. We are going to save you some of that pain.

We will use a teaching dataset that connects to our running training mystery: a simplified version of the kind of data you would encounter when studying a job training program.

Loading Data

The first step is getting data into your software. Datasets come in many formats: CSV, Stata .dta, Excel, parquet, and more. Here is how to load data in each language:

1# CSV (most common)
2df <- read.csv("training_program.csv")
3
4# Stata .dta files (common in economics)
5library(haven)
6df <- read_dta("training_program.dta")
7
8# Excel
9library(readxl)
10df <- read_excel("training_program.xlsx", sheet = "Sheet1")
11
12# RDS (R's native binary format — fast, preserves types)
13df <- readRDS("training_program.rds")
14
15# Quick look
16head(df)
17cat("Dataset:", nrow(df), "observations,", ncol(df), "variables\n")

Requireshaven readxl

Inspecting Data: Know What You Have

Before you run any analysis, it is important to understand your data. Inspection is not optional — it is where you catch problems that would silently compromise your results.

Variable Types and Summary Statistics

1# Structure: types, dimensions, first values
2str(df)
3
4# Summary statistics
5summary(df)
6
7# Check for missing values
8colSums(is.na(df))
9
10# Value counts for categorical variables
11table(df$treatment)
12
13# Cross-tabulation
14table(df$treatment, df$female)

Cleaning Data

Real-world data are imperfect. Here are the most common issues and how to handle them.

Missing Values

Missing values can be missing completely at random (MCAR — listwise deletion is unbiased but loses power), missing at random conditional on observed variables (MAR — requires methods like multiple imputation), or missing not at random (MNAR — a serious threat to validity). First, understand the pattern of missingness:

1# Percentage missing per variable
2sort(colMeans(is.na(df)) * 100, decreasing = TRUE)
3
4# Are missing earnings related to treatment?
5tapply(is.na(df$earnings), df$treatment, mean)
6
7# Drop rows with missing earnings
8df_clean <- df[!is.na(df$earnings), ]
9
10# Or fill missing values (use with caution)
11# df$education[is.na(df$education)] <- median(df$education, na.rm = TRUE)

Outliers and Implausible Values

1# Check for implausible values
2range(df$age)
3range(df$earnings, na.rm = TRUE)
4
5# Flag potential outliers
6quantile(df$earnings, c(0.01, 0.99), na.rm = TRUE)
7
8# Winsorize at 1st and 99th percentiles
9library(DescTools)
10df$earnings_w <- Winsorize(df$earnings, probs = c(0.01, 0.99))

RequiresDescTools

Merging Datasets

Most empirical projects require combining multiple datasets. For example, you might have one file with individual characteristics and another with earnings by year. Merges are where errors love to hide.

1# One-to-one merge
2merged <- merge(demographics, enrollment, by = "person_id", all = FALSE)
3
4# Check how many matched
5cat("Matched:", nrow(merged), "\n")
6cat("Unmatched left:", nrow(demographics) - nrow(merged), "\n")
7
8# Many-to-one merge: adding time-invariant characteristics to panel
9panel <- merge(earnings_panel, demographics, by = "person_id", all.x = TRUE)
10
11# With dplyr (tidyverse):
12# library(dplyr)
13# merged <- inner_join(demographics, enrollment, by = "person_id")
14# panel <- left_join(earnings_panel, demographics, by = "person_id")

Requiresdplyr tidyverse

Panel Data Structure

Many causal inference methods (fixed effects, difference-in-differences, event studies) require : repeated observations of the same units over time. Understanding panel structure is essential.

Long vs. Wide Format

Long format (one row per unit-period): preferred for most statistical software.

person_id	year	earnings
1	2018	25000
1	2019	28000
2	2018	32000
2	2019	31000

Wide format (one row per unit, periods in columns): sometimes useful for specific calculations.

person_id	earnings_2018	earnings_2019
1	25000	28000
2	32000	31000

1library(tidyr)
2
3# Wide to long
4long <- pivot_longer(
5wide,
6cols = starts_with("earnings_"),
7names_to = "year",
8names_prefix = "earnings_",
9values_to = "earnings"
10)
11long$year <- as.integer(long$year)
12
13# Long to wide
14wide <- pivot_wider(
15long,
16names_from = year,
17values_from = earnings,
18names_prefix = "earnings_"
19)

Requirestidyr

Declaring Panel Structure

Before running panel methods, tell your software that the data are a panel:

1library(plm)
2
3# Declare panel structure
4pdata <- pdata.frame(df, index = c("person_id", "year"))
5
6# Check balance
7pdim(pdata)
8# Balanced: TRUE/FALSE
9# n (units), T (periods), N (total obs)

Requiresplm

Constructing Variables

Raw data rarely contains the exact variables you need. Constructing variables is where domain knowledge meets data skills.

Common Transformations

1library(dplyr)
2
3df <- df %>%
4mutate(
5  # Log transformation
6  log_earnings = log(earnings + 1),
7
8  # Treatment indicator
9  post_training = as.integer(year >= 2019),
10
11  # Interaction term (for DiD)
12  treat_x_post = treatment * post_training
13) %>%
14# Lagged variable
15arrange(person_id, year) %>%
16group_by(person_id) %>%
17mutate(
18  earnings_lag = lag(earnings, 1),
19  earnings_change = earnings - first(earnings)
20) %>%
21ungroup()

Requiresdplyr did

Aggregating and Summarizing by Group

Many causal inference applications require group-level summaries: average earnings by treatment status, outcome means by cohort, or pre-treatment trends by unit. Group-by operations are fundamental.

1library(dplyr)
2
3# Mean earnings by treatment group
4df %>% group_by(treatment) %>% summarise(mean_earnings = mean(earnings, na.rm = TRUE))
5
6# Multiple summary statistics by group
7df %>%
8group_by(treatment) %>%
9summarise(
10  mean_earnings = mean(earnings, na.rm = TRUE),
11  sd_earnings   = sd(earnings, na.rm = TRUE),
12  n             = n()
13)
14
15# Collapse to state-year level
16collapsed <- df %>%
17group_by(state, year) %>%
18summarise(mean_earnings = mean(earnings), n_obs = n(), .groups = "drop")
19
20# Within-group demeaning (fixed effects intuition)
21df <- df %>%
22group_by(person_id) %>%
23mutate(earnings_demeaned = earnings - mean(earnings, na.rm = TRUE)) %>%
24ungroup()

Requiresdplyr

Common Data Pitfalls

Pitfalls that have sunk published papers

1. in panel data. If people drop out of your panel (attrition), and attrition is correlated with treatment, your estimates are biased — a form of selection bias. A common recommendation is to check whether attrition is balanced across treatment and control groups.

2. Duplicate observations. After merging, check that you typically do not have duplicate rows. A many-to-many merge can silently multiply your observations, inflating your sample size and deflating your standard errors.

3. Incorrect merge keys. Merging on names instead of IDs, or merging when the key has different formats (string vs. numeric, leading zeros stripped). In most settings, verify the merge with a frequency table.

4. Miscoded missing values. Some datasets code missing values as -99, 999, or blank strings rather than proper missing indicators. These sentinel values will be treated as real values in your analysis.

5. Measurement unit changes. If earnings are in dollars in one year and thousands of dollars in another (or if the dataset switched coding conventions), your before-after comparison is meaningless. In most settings, check units before proceeding.

6. Confusing stock and flow variables. Employment at a point in time (stock) is different from total months employed in a year (flow). Make sure you know which one your variable measures.

Many of these pitfalls have contributed to failures of replication in the social sciences. Transparent data handling practices are not just good hygiene — they are important for credible research.

Concept Check

You merge two datasets on person_id and the number of rows in the merged dataset is three times larger than either original dataset. What most likely happened?

The merge worked correctly — you now have more observationsThere are duplicate person_id values in one or both datasets, creating a many-to-many mergeThe software appended the datasets instead of merging themMissing values in person_id created extra rows

A Data Preparation Checklist

Before running any causal inference method, work through this checklist:

Load and inspect: How many observations? How many variables? What are the types?
Check missingness: How much? Is it random or systematic? Does it differ by treatment status?
Check for duplicates: Is each observation unique at the expected level (person, person-year, etc.)?
Validate ranges: Are all values plausible? Age between 0 and 120? Earnings non-negative (or are negatives meaningful)?
Verify merges: Did the merge produce the expected number of rows? Check the match rate.
Construct variables: Create treatment indicators, logs, lags, differences, interactions.
Examine balance: Compare treatment and control groups on pre-treatment characteristics. Are they similar?
Document everything: Keep a record of every cleaning step, every decision about how to handle missing values, every variable construction. Your future self (and your referees) will thank you. Christensen and Miguel (2018) provide a comprehensive guide to transparent and reproducible research practices. Our guide on how to replicate shows what thorough documentation looks like in practice.

✓Key Takeaways

Key Takeaways

Inspect before analyzing. Look at your data before running regressions. Check dimensions, types, missing values, and distributions.
Cleaning is research. Decisions about missing values, outliers, and sample restrictions are research decisions that affect your results. Document them.
Merges are dangerous. A common recommendation is to check the match rate, watch for unexpected row count changes, and verify merge key uniqueness.
Panel data has a specific structure (unit-period). Know the difference between long and wide formats, and declare panel structure before running panel methods.
Variable construction is where domain knowledge meets data. Log transformations, lags, interactions, and indicators are the building blocks of every empirical specification — from OLS regression through more advanced designs.
The checklist works. Following a systematic data preparation workflow catches errors that would otherwise propagate silently into your results.

→What Comes Next

You now have both the conceptual foundation (the previous six pages) and the practical skills (this page) to begin learning specific methods. But before we dive in, there is one more story to tell: how did the field of empirical economics get to where it is today? The answer is the credibility revolution — a transformation in how researchers think about evidence that changed what counts as convincing. That story also resolves our training mystery.

Next Step: The Credibility Revolution — The story of how the field learned to demand credible identification, and the resolution of our training mystery.

From Theory to Practice#

Loading Data#

Inspecting Data: Know What You Have#

Variable Types and Summary Statistics#

Cleaning Data#

Missing Values#

Outliers and Implausible Values#

Merging Datasets#

Panel Data Structure#

Long vs. Wide Format#

Declaring Panel Structure#

Constructing Variables#

Common Transformations#

Aggregating and Summarizing by Group#

Common Data Pitfalls#

A Data Preparation Checklist#

✓Key Takeaways#

→What Comes Next#