MethodAtlas
tutorial90 minutes

Lab: OLS Regression from Scratch

Build intuition for OLS regression by estimating a Mincer earnings equation step by step. Learn to compute robust and clustered standard errors, diagnose problems, and interpret results like a careful applied researcher.

Overview

In this lab you will estimate the classic Mincer earnings equation using a simulated dataset that mimics the Current Population Survey (CPS). You will move from a simple bivariate regression to a full specification with robust and clustered standard errors, learning to interpret every piece of output along the way.

What you will learn:

  • How to estimate OLS regressions and read the output
  • The difference between conventional, robust, and clustered standard errors
  • How to diagnose heteroscedasticity and multicollinearity
  • How omitted variable bias changes your estimates
  • How to write up results for a paper

Prerequisites: Familiarity with basic statistics (mean, variance, correlation). No prior regression experience required.


Step 1: Load and Explore the Data

We will work with a simulated dataset of 5,000 workers containing wages, education, experience, gender, and state of residence.

library(estimatr)
library(modelsummary)

# Simulate CPS-like data
set.seed(42)
n <- 5000
ability <- rnorm(n)
educ <- pmin(pmax(round(12 + 2 * ability + rnorm(n, sd = 1.5)), 8), 20)
exper <- round(runif(n, 0, 30))
female <- rbinom(n, 1, 0.5)
state <- sample(1:50, n, replace = TRUE)

log_wage <- 1.5 + 0.06 * educ + 0.03 * exper - 0.0005 * exper^2 -
          0.15 * female + 0.10 * ability + rnorm(n, sd = 0.4)

df <- data.frame(log_wage, educ, exper, exper_sq = exper^2,
               female, state, ability)

summary(df[, c("log_wage", "educ", "exper", "female")])

Step 2: Simple Bivariate Regression

Start with the simplest possible regression: log wages on education only.

# Simple bivariate regression
m1 <- lm(log_wage ~ educ, data = df)
summary(m1)
cat("Coefficient on educ:", coef(m1)["educ"], "\n")
Concept Check

The true effect of education in our data generating process is 0.06. Your bivariate regression likely produced a coefficient larger than 0.06 (around 0.08-0.10). Why is the estimate biased upward?


Step 3: Add Controls and Watch OVB Shrink

Now add experience and gender to the regression. Then, since we simulated the data, we can also add the normally-unobservable ability variable to see the true effect.

# Model 2: Add experience and gender
m2 <- lm(log_wage ~ educ + exper + exper_sq + female, data = df)

# Model 3: Add ability (normally unobservable!)
m3 <- lm(log_wage ~ educ + exper + exper_sq + female + ability, data = df)

# Compare
cat("Coefficient on education:\n")
cat("  Model 1 (bivariate):", coef(m1)["educ"], "\n")
cat("  Model 2 (+ controls):", coef(m2)["educ"], "\n")
cat("  Model 3 (+ ability):", coef(m3)["educ"], "\n")
cat("  True value: 0.06\n")

Step 4: Robust Standard Errors

Conventional standard errors assume constant error variance (homoscedasticity). Let us test whether that holds and switch to robust standard errors.

# Robust standard errors with estimatr
m2_robust <- lm_robust(log_wage ~ educ + exper + exper_sq + female,
                      data = df, se_type = "HC2")

# Compare SEs
cat("SE on education (conventional):", summary(m2)$coefficients["educ", "Std. Error"], "\n")
cat("SE on education (robust HC2):", m2_robust$std.error["educ"], "\n")
Requiresestimatr

Step 5: Clustered Standard Errors

When your treatment or key variable of interest varies at a group level (e.g., state policy), you must cluster standard errors at that level. Let us simulate a scenario where state-level factors affect wages.

# Add state-level shocks
state_effects <- rnorm(50, sd = 0.3)
df$log_wage_state <- df$log_wage + state_effects[df$state]

# Clustered SEs
m_cluster <- lm_robust(log_wage_state ~ educ + exper + exper_sq + female,
                      data = df,
                      clusters = state,
                      se_type = "CR2")

cat("SE on education (robust):", m2_robust$std.error["educ"], "\n")
cat("SE on education (clustered):", m_cluster$std.error["educ"], "\n")
Concept Check

When should you cluster standard errors? Select the best answer.


Step 6: Diagnosing Problems

Check for Multicollinearity

library(car)

# VIF requires a standard lm object
m_vif <- lm(log_wage ~ educ + exper + exper_sq + female, data = df)
vif(m_vif)
# Rule of thumb: VIF > 10 indicates problematic multicollinearity
Requirescar

Residual Plot

# Base R diagnostic plots
par(mfrow = c(2, 2))
plot(m2)

# The first plot (Residuals vs Fitted) is the key one
# Look for: random scatter (good) vs funnel/pattern (bad)

Step 7: Present Results for Publication

A well-formatted regression table is essential for any empirical paper. Here is how to produce one.

library(modelsummary)

# Publication-quality table
models <- list(
"(1)" = lm_robust(log_wage ~ educ, data = df, se_type = "HC2"),
"(2)" = lm_robust(log_wage ~ educ + exper + exper_sq + female,
                   data = df, se_type = "HC2")
)

modelsummary(models,
           stars = c('*' = 0.1, '**' = 0.05, '***' = 0.01),
           gof_map = c("nobs", "r.squared"),
           output = "default")
Requiresmodelsummary

Step 8: Exercises

Try these on your own:

  1. Add interaction terms. Does the return to education differ for men and women? Add educ * female to your regression and interpret the interaction coefficient.

  2. Try log-log. Replace educ with log(educ) and interpret the new coefficient as an elasticity.

  3. Subsample analysis. Split the data by gender and estimate separate regressions. Compare the coefficients. Are the differences statistically significant?

  4. Sensitivity check. Use the sensemakr package (R) or equivalent to assess how sensitive your estimates are to unobserved confounders (the Cinelli and Hazlett (2020) approach).


Summary

In this lab you learned:

  • OLS estimates the conditional mean relationship — causal interpretation requires additional assumptions
  • The coefficient changes when you add or remove controls — this instability is omitted variable bias in action
  • In most settings, use robust standard errors; cluster when the treatment varies at a group level
  • High VIF between a variable and its square is expected and not a problem
  • A regression table should report coefficients, standard errors (not t-stats), sample size, and R-squared
  • Careful interpretation and honest discussion of limitations are what distinguish good applied work