Lab: OLS Regression from Scratch
Build intuition for OLS regression by estimating a Mincer earnings equation step by step. Learn to compute robust and clustered standard errors, diagnose problems, and interpret results like a careful applied researcher.
Overview
In this lab you will estimate the classic Mincer earnings equation using a simulated dataset that mimics the Current Population Survey (CPS). You will move from a simple bivariate regression to a full specification with robust and clustered standard errors, learning to interpret every piece of output along the way.
What you will learn:
- How to estimate OLS regressions and read the output
- The difference between conventional, robust, and clustered standard errors
- How to diagnose heteroscedasticity and multicollinearity
- How omitted variable bias changes your estimates
- How to write up results for a paper
Prerequisites: Familiarity with basic statistics (mean, variance, correlation). No prior regression experience required.
Step 1: Load and Explore the Data
We will work with a simulated dataset of 5,000 workers containing wages, education, experience, gender, and state of residence.
library(estimatr)
library(modelsummary)
# Simulate CPS-like data
set.seed(42)
n <- 5000
ability <- rnorm(n)
educ <- pmin(pmax(round(12 + 2 * ability + rnorm(n, sd = 1.5)), 8), 20)
exper <- round(runif(n, 0, 30))
female <- rbinom(n, 1, 0.5)
state <- sample(1:50, n, replace = TRUE)
log_wage <- 1.5 + 0.06 * educ + 0.03 * exper - 0.0005 * exper^2 -
0.15 * female + 0.10 * ability + rnorm(n, sd = 0.4)
df <- data.frame(log_wage, educ, exper, exper_sq = exper^2,
female, state, ability)
summary(df[, c("log_wage", "educ", "exper", "female")])Step 2: Simple Bivariate Regression
Start with the simplest possible regression: log wages on education only.
# Simple bivariate regression
m1 <- lm(log_wage ~ educ, data = df)
summary(m1)
cat("Coefficient on educ:", coef(m1)["educ"], "\n")The true effect of education in our data generating process is 0.06. Your bivariate regression likely produced a coefficient larger than 0.06 (around 0.08-0.10). Why is the estimate biased upward?
Step 3: Add Controls and Watch OVB Shrink
Now add experience and gender to the regression. Then, since we simulated the data, we can also add the normally-unobservable ability variable to see the true effect.
# Model 2: Add experience and gender
m2 <- lm(log_wage ~ educ + exper + exper_sq + female, data = df)
# Model 3: Add ability (normally unobservable!)
m3 <- lm(log_wage ~ educ + exper + exper_sq + female + ability, data = df)
# Compare
cat("Coefficient on education:\n")
cat(" Model 1 (bivariate):", coef(m1)["educ"], "\n")
cat(" Model 2 (+ controls):", coef(m2)["educ"], "\n")
cat(" Model 3 (+ ability):", coef(m3)["educ"], "\n")
cat(" True value: 0.06\n")Step 4: Robust Standard Errors
Conventional standard errors assume constant error variance (homoscedasticity). Let us test whether that holds and switch to robust standard errors.
# Robust standard errors with estimatr
m2_robust <- lm_robust(log_wage ~ educ + exper + exper_sq + female,
data = df, se_type = "HC2")
# Compare SEs
cat("SE on education (conventional):", summary(m2)$coefficients["educ", "Std. Error"], "\n")
cat("SE on education (robust HC2):", m2_robust$std.error["educ"], "\n")Step 5: Clustered Standard Errors
When your treatment or key variable of interest varies at a group level (e.g., state policy), you must cluster standard errors at that level. Let us simulate a scenario where state-level factors affect wages.
# Add state-level shocks
state_effects <- rnorm(50, sd = 0.3)
df$log_wage_state <- df$log_wage + state_effects[df$state]
# Clustered SEs
m_cluster <- lm_robust(log_wage_state ~ educ + exper + exper_sq + female,
data = df,
clusters = state,
se_type = "CR2")
cat("SE on education (robust):", m2_robust$std.error["educ"], "\n")
cat("SE on education (clustered):", m_cluster$std.error["educ"], "\n")When should you cluster standard errors? Select the best answer.
Step 6: Diagnosing Problems
Check for Multicollinearity
library(car)
# VIF requires a standard lm object
m_vif <- lm(log_wage ~ educ + exper + exper_sq + female, data = df)
vif(m_vif)
# Rule of thumb: VIF > 10 indicates problematic multicollinearityResidual Plot
# Base R diagnostic plots
par(mfrow = c(2, 2))
plot(m2)
# The first plot (Residuals vs Fitted) is the key one
# Look for: random scatter (good) vs funnel/pattern (bad)Step 7: Present Results for Publication
A well-formatted regression table is essential for any empirical paper. Here is how to produce one.
library(modelsummary)
# Publication-quality table
models <- list(
"(1)" = lm_robust(log_wage ~ educ, data = df, se_type = "HC2"),
"(2)" = lm_robust(log_wage ~ educ + exper + exper_sq + female,
data = df, se_type = "HC2")
)
modelsummary(models,
stars = c('*' = 0.1, '**' = 0.05, '***' = 0.01),
gof_map = c("nobs", "r.squared"),
output = "default")Step 8: Exercises
Try these on your own:
-
Add interaction terms. Does the return to education differ for men and women? Add
educ * femaleto your regression and interpret the interaction coefficient. -
Try log-log. Replace
educwithlog(educ)and interpret the new coefficient as an elasticity. -
Subsample analysis. Split the data by gender and estimate separate regressions. Compare the coefficients. Are the differences statistically significant?
-
Sensitivity check. Use the
sensemakrpackage (R) or equivalent to assess how sensitive your estimates are to unobserved confounders (the Cinelli and Hazlett (2020) approach).
Summary
In this lab you learned:
- OLS estimates the conditional mean relationship — causal interpretation requires additional assumptions
- The coefficient changes when you add or remove controls — this instability is omitted variable bias in action
- In most settings, use robust standard errors; cluster when the treatment varies at a group level
- High VIF between a variable and its square is expected and not a problem
- A regression table should report coefficients, standard errors (not t-stats), sample size, and R-squared
- Careful interpretation and honest discussion of limitations are what distinguish good applied work