Chapter 7 of 8
Working with Data
Loading, cleaning, reshaping, and constructing variables — the unglamorous but essential work.
From Theory to Practice
You now have the conceptual foundation: you understand causal inference (F1), selection bias (F3), identification language (F4), DAGs (F5), and the full taxonomy of methods (F6). But before you can implement any of these methods, you need to be comfortable doing something more mundane and equally important: working with data.
This page covers the practical skills that bridge the gap between understanding a method and actually running it. As Angrist and Pischke (2009) emphasize, even the most elegant identification strategy fails if the data are not properly prepared. Imbens and Rubin (2015) also devote substantial attention to the practical steps of data preparation that precede any causal analysis. These skills are rarely taught formally — most PhD students pick them up by trial, error, and occasional panic the night before a deadline. We are going to save you some of that pain.
We will use a teaching dataset that connects to our running training mystery: a simplified version of the kind of data you would encounter when studying a job training program.
Loading Data
The first step is getting data into your software. Datasets come in many formats: CSV, Stata .dta, Excel, parquet, and more. Here is how to load data in each language:
# CSV (most common)
df <- read.csv("training_program.csv")
# Stata .dta files (common in economics)
library(haven)
df <- read_dta("training_program.dta")
# Excel
library(readxl)
df <- read_excel("training_program.xlsx", sheet = "Sheet1")
# RDS (R's native binary format — fast, preserves types)
df <- readRDS("training_program.rds")
# Quick look
head(df)
cat("Dataset:", nrow(df), "observations,", ncol(df), "variables\n")Inspecting Data: Know What You Have
Before you run any analysis, it is important to understand your data. Data inspection is not optional — data inspection is where you catch problems that would silently ruin your results.
Variable Types and Summary Statistics
# Structure: types, dimensions, first values
str(df)
# Summary statistics
summary(df)
# Check for missing values
colSums(is.na(df))
# Value counts for categorical variables
table(df$treatment)
# Cross-tabulation
table(df$treatment, df$female)Cleaning Data
Real-world data is imperfect. Here are the most common issues and how to handle them.
Missing Values
Missing values can be random (not a problem for most analyses, though you lose power) or systematic (a serious threat to validity). First, understand the pattern of missingness:
# Percentage missing per variable
sort(colMeans(is.na(df)) * 100, decreasing = TRUE)
# Are missing earnings related to treatment?
tapply(is.na(df$earnings), df$treatment, mean)
# Drop rows with missing earnings
df_clean <- df[!is.na(df$earnings), ]
# Or fill missing values (use with caution)
# df$education[is.na(df$education)] <- median(df$education, na.rm = TRUE)Outliers and Implausible Values
# Check for implausible values
range(df$age)
range(df$earnings, na.rm = TRUE)
# Flag potential outliers
quantile(df$earnings, c(0.01, 0.99), na.rm = TRUE)
# Winsorize at 1st and 99th percentiles
library(DescTools)
df$earnings_w <- Winsorize(df$earnings, probs = c(0.01, 0.99))Merging Datasets
Most empirical projects require combining multiple datasets. For example, you might have one file with individual characteristics and another with earnings by year. Merges are where errors love to hide.
# One-to-one merge
merged <- merge(demographics, enrollment, by = "person_id", all = FALSE)
# Check how many matched
cat("Matched:", nrow(merged), "\n")
cat("Unmatched left:", nrow(demographics) - nrow(merged), "\n")
# Many-to-one merge: adding time-invariant characteristics to panel
panel <- merge(earnings_panel, demographics, by = "person_id", all.x = TRUE)
# With dplyr (tidyverse):
# library(dplyr)
# merged <- inner_join(demographics, enrollment, by = "person_id")
# panel <- left_join(earnings_panel, demographics, by = "person_id")Panel Data Structure
Many causal inference methods (fixed effects, difference-in-differences, event studies) require : repeated observations of the same units over time. Understanding panel structure is essential.
Long vs. Wide Format
Long format (one row per unit-period): preferred for most statistical software.
| person_id | year | earnings |
|---|---|---|
| 1 | 2018 | 25000 |
| 1 | 2019 | 28000 |
| 2 | 2018 | 32000 |
| 2 | 2019 | 31000 |
Wide format (one row per unit, periods in columns): sometimes useful for specific calculations.
| person_id | earnings_2018 | earnings_2019 |
|---|---|---|
| 1 | 25000 | 28000 |
| 2 | 32000 | 31000 |
library(tidyr)
# Wide to long
long <- pivot_longer(
wide,
cols = starts_with("earnings_"),
names_to = "year",
names_prefix = "earnings_",
values_to = "earnings"
)
long$year <- as.integer(long$year)
# Long to wide
wide <- pivot_wider(
long,
names_from = year,
values_from = earnings,
names_prefix = "earnings_"
)Declaring Panel Structure
Before running panel methods, tell your software that the data is a panel:
library(plm)
# Declare panel structure
pdata <- pdata.frame(df, index = c("person_id", "year"))
# Check balance
pdim(pdata)
# Balanced: TRUE/FALSE
# n (units), T (periods), N (total obs)Constructing Variables
Raw data rarely contains the exact variables you need. Constructing variables is where domain knowledge meets data skills.
Common Transformations
library(dplyr)
df <- df %>%
mutate(
# Log transformation
log_earnings = log(earnings + 1),
# Treatment indicator
post_training = as.integer(year >= 2019),
# Interaction term (for DiD)
treat_x_post = treatment * post_training
) %>%
# Lagged variable
arrange(person_id, year) %>%
group_by(person_id) %>%
mutate(
earnings_lag = lag(earnings, 1),
earnings_change = earnings - first(earnings)
) %>%
ungroup()Aggregating and Summarizing by Group
Many causal inference applications require group-level summaries: average earnings by treatment status, outcome means by cohort, or pre-treatment trends by unit. Group-by operations are fundamental.
library(dplyr)
# Mean earnings by treatment group
df %>% group_by(treatment) %>% summarise(mean_earnings = mean(earnings, na.rm = TRUE))
# Multiple summary statistics by group
df %>%
group_by(treatment) %>%
summarise(
mean_earnings = mean(earnings, na.rm = TRUE),
sd_earnings = sd(earnings, na.rm = TRUE),
n = n()
)
# Collapse to state-year level
collapsed <- df %>%
group_by(state, year) %>%
summarise(mean_earnings = mean(earnings), n_obs = n(), .groups = "drop")
# Within-group demeaning (fixed effects intuition)
df <- df %>%
group_by(person_id) %>%
mutate(earnings_demeaned = earnings - mean(earnings, na.rm = TRUE)) %>%
ungroup()Common Data Pitfalls
You merge two datasets on person_id and the number of rows in the merged dataset is three times larger than either original dataset. What most likely happened?
A Data Preparation Checklist
Before running any causal inference method, work through this checklist:
- Load and inspect: How many observations? How many variables? What are the types?
- Check missingness: How much? Is it random or systematic? Does it differ by treatment status?
- Check for duplicates: Is each observation unique at the expected level (person, person-year, etc.)?
- Validate ranges: Are all values plausible? Age between 0 and 120? Earnings non-negative (or are negatives meaningful)?
- Verify merges: Did the merge produce the expected number of rows? Check the match rate.
- Construct variables: Create treatment indicators, logs, lags, differences, interactions.
- Examine balance: Compare treatment and control groups on pre-treatment characteristics. Are they similar?
- Document everything: Keep a record of every cleaning step, every decision about how to handle missing values, every variable construction. Your future self (and your referees) will thank you. Christensen and Miguel (2018) provide a comprehensive guide to transparent and reproducible research practices. Our guide on how to replicate shows what thorough documentation looks like in practice.
Key Takeaways
What Comes Next
You now have both the conceptual foundation (F1–F6) and the practical skills (this page) to begin learning specific methods. But before we dive in, there is one more story to tell: how did the field of empirical economics get to where it is today? The answer is the credibility revolution — a transformation in how researchers think about evidence that changed what counts as convincing. That story also resolves our training mystery.
Next Step: The Credibility Revolution — The story of how the field learned to demand credible identification, and the resolution of our training mystery.