Lab·tutorial·8 min read

tutorial90 minutes

Lab: OLS Regression from Scratch

Build OLS intuition by estimating a Mincer earnings equation: robust and clustered SEs, diagnostics, and interpretation as a careful applied researcher.

Method: OLS (Robust SEs, Clustering)
Languages: Python, R, Stata
Dataset: CPS earnings data (simulated)

Overview

In this lab you will estimate the classic Mincer earnings equation using a simulated dataset that mimics the Current Population Survey (CPS). You will move from a simple bivariate regression to a full specification with robust and clustered standard errors, learning to interpret every piece of output along the way.

What you will learn:

How to estimate OLS regressions and read the output
The difference between conventional, robust, and clustered standard errors
How to diagnose and multicollinearity
How changes your estimates
How to write up results for a paper

Prerequisites: Familiarity with basic statistics (mean, variance, correlation). No prior regression experience required.

Step 1: Load and Explore the Data

We will work with a simulated dataset of 5,000 workers containing wages, education, experience, gender, and state of residence.

1# First-time setup: install.packages(c("estimatr", "modelsummary"))
2library(estimatr)
3library(modelsummary)
4
5# Simulate CPS-like data
6set.seed(42)
7n <- 5000
8ability <- rnorm(n)
9educ <- pmin(pmax(round(12 + 2 * ability + rnorm(n, sd = 1.5)), 8), 20)
10exper <- round(runif(n, 0, 30))
11female <- rbinom(n, 1, 0.5)
12state <- sample(1:50, n, replace = TRUE)
13
14log_wage <- 1.5 + 0.06 * educ + 0.03 * exper - 0.0005 * exper^2 -
15          0.15 * female + 0.10 * ability + rnorm(n, sd = 0.4)
16
17df <- data.frame(log_wage, educ, exper, exper_sq = exper^2,
18               female, state, ability)
19
20summary(df[, c("log_wage", "educ", "exper", "female")])

Requiresestimatr modelsummary

Expected output: Summary statistics

Summary statistics (5,000 workers):

Statistic	log_wage	educ	exper	female
mean	2.370	12.00	15.00	0.50
std	0.560	2.40	8.70	0.50
min	0.450	8.00	0.00	0.00
25%	1.980	10.00	7.00	0.00
50%	2.370	12.00	15.00	1.00
75%	2.760	14.00	23.00	1.00
max	4.200	20.00	30.00	1.00

Sample data (first 5 rows):

log_wage	educ	exper	female
2.35	13	22	0
1.94	10	5	1
2.81	15	18	0
2.12	11	9	1
2.63	14	27	0

Note: Your exact values will differ due to random seed differences across languages, but the distributions should be similar.

Step 2: Simple Bivariate Regression

Start with the simplest possible regression: log wages on education only.

# Simple bivariate regression
m1 <- lm(log_wage ~ educ, data = df)
summary(m1)
cat("Coefficient on educ:", coef(m1)["educ"], "\n")

Expected output: Bivariate regression results

Model 1: log_wage = a + b * educ

Variable	Coefficient	Std. Error	t-statistic	p-value
Intercept	1.2800	0.045	28.44	< 0.001
educ	0.0910	0.004	24.76	< 0.001

R-squared: 0.25 | N: 5,000

The coefficient on education (~0.09) is noticeably larger than the true DGP value of 0.06. This upward bias occurs because ability — which is positively correlated with both education and wages — is omitted from the regression.

Concept Check

The true effect of education in our data generating process is 0.06. Your bivariate regression likely produced a coefficient larger than 0.06 (around 0.08-0.10). Why is the estimate biased upward?

The sample size is too small for OLS to work.Ability is omitted from the regression. Since ability is positively correlated with both education and wages, the education coefficient picks up part of the ability effect.The log transformation distorts the coefficient.OLS is inherently biased with non-experimental data.

Step 3: Add Controls and Watch OVB Shrink

Now add experience and gender to the regression. Then, since we simulated the data, we can also add the normally-unobservable ability variable to see the true effect.

1# Model 2: Add experience and gender
2m2 <- lm(log_wage ~ educ + exper + exper_sq + female, data = df)
3
4# Model 3: Add ability (normally unobservable!)
5m3 <- lm(log_wage ~ educ + exper + exper_sq + female + ability, data = df)
6
7# Compare
8cat("Coefficient on education:\n")
9cat("  Model 1 (bivariate):", coef(m1)["educ"], "\n")
10cat("  Model 2 (+ controls):", coef(m2)["educ"], "\n")
11cat("  Model 3 (+ ability):", coef(m3)["educ"], "\n")
12cat("  True value: 0.06\n")

Expected output: Education coefficient across specifications

Coefficient on education across models:

Model	Coeff. on educ	Std. Error	R-squared	Bias relative to truth
Model 1: Bivariate	0.091	0.004	0.25	+0.031 (upward)
Model 2: + exper, female	0.079	0.003	0.40	+0.019 (upward)
Model 3: + ability	0.061	0.003	0.55	+0.001 (near zero)
True DGP value	0.060	—	—	—

The education coefficient shrinks toward 0.06 as we add controls, confirming omitted variable bias. Only when the unobservable (ability) is included does the estimate approach the truth.

Step 4: Robust Standard Errors

Conventional standard errors assume constant error variance (homoscedasticity). Let us test whether the assumption holds and switch to robust standard errors.

1# Robust standard errors with estimatr
2m2_robust <- lm_robust(log_wage ~ educ + exper + exper_sq + female,
3                      data = df, se_type = "HC2")
4
5# Compare SEs
6cat("SE on education (conventional):", summary(m2)$coefficients["educ", "Std. Error"], "\n")
7cat("SE on education (robust HC2):", m2_robust$std.error["educ"], "\n")

Requiresestimatr

Expected output: Standard error comparison

SE comparison for the education coefficient (Model 2):

SE Type	Coefficient	Std. Error	95% CI	p-value
Conventional	0.079	0.00320	[0.073, 0.085]	< 0.001
Robust (HC1)	0.079	0.00325	[0.073, 0.085]	< 0.001

The coefficients are identical across SE types — only the standard errors change. In this case the conventional and robust SEs are nearly the same, indicating mild heteroscedasticity in the simulated data.

Step 5: Clustered Standard Errors

When your treatment or key variable of interest varies at a group level (e.g., state policy), you typically need to cluster standard errors at that level. Let us simulate a scenario where state-level factors affect wages.

1# Add state-level shocks
2state_effects <- rnorm(50, sd = 0.3)
3df$log_wage_state <- df$log_wage + state_effects[df$state]
4
5# Clustered SEs
6m_cluster <- lm_robust(log_wage_state ~ educ + exper + exper_sq + female,
7                      data = df,
8                      clusters = state,
9                      se_type = "CR2")
10
11cat("SE on education (robust):", m2_robust$std.error["educ"], "\n")
12cat("SE on education (clustered):", m_cluster$std.error["educ"], "\n")

Expected output: Conventional vs. robust vs. clustered SEs

Standard errors on education coefficient (with state-level shocks):

SE Type	Coefficient	Std. Error	Change vs. Conventional
Conventional	0.079	0.00310	—
Robust (HC1)	0.079	0.00325	+5%
Clustered by state (50)	0.079	0.00480	+55%

Clustered standard errors are substantially larger than robust SEs because they account for within-state correlation in the error terms. The coefficient itself is unchanged — only the precision of the estimate changes.

Concept Check

When should you cluster standard errors? Select the best answer.

Always — clustered SEs are always more conservative.When your treatment or key regressor varies at a group level (e.g., state policies affecting individual workers), and errors are likely correlated within groups.Only when you have panel data.When the sample size is small.

Step 6: Diagnosing Problems

Check for Multicollinearity

1# First-time setup: install.packages(c("car"))
2library(car)
3
4# VIF requires a standard lm object
5m_vif <- lm(log_wage ~ educ + exper + exper_sq + female, data = df)
6vif(m_vif)
7# Rule of thumb: VIF > 10 indicates problematic multicollinearity

Requirescar

Expected output: Variance Inflation Factors

Variable	VIF	Diagnosis
educ	1.05	No concern
exper	19.80	High — but expected (see note below)
exper_sq	19.50	High — but expected (see note below)
female	1.01	No concern

Rule of thumb: VIF > 10 suggests problematic multicollinearity. However, the high VIF for exper and exper_sq is mechanical and expected.

Residual Plot

1# Base R diagnostic plots
2par(mfrow = c(2, 2))
3plot(m2)
4
5# The first plot (Residuals vs Fitted) is the key one
6# Look for: random scatter (good) vs funnel/pattern (bad)

Expected visualization: Residuals vs. Fitted Values

What to expect: A scatter plot with fitted values on the x-axis (ranging from about 1.5 to 3.5) and residuals on the y-axis (ranging from about -1.2 to +1.2). A red dashed horizontal line at y = 0 marks the reference.

How to interpret it:

Random cloud around zero: The residuals should appear as a roughly symmetric cloud centered on zero with no obvious pattern — this pattern indicates the linear functional form is reasonable.
Slight funnel shape: You may notice the spread of residuals is slightly wider at higher fitted values, suggesting mild heteroscedasticity. This pattern is common in wage regressions and is why we use robust standard errors.
No curvature: The absence of a U-shape or inverted-U pattern confirms that the quadratic experience term adequately captures the nonlinearity in the experience-earnings profile.

If you see a strong funnel, curved pattern, or clusters, revisit your model specification.

Step 7: Present Results for Publication

A well-formatted regression table is important for any empirical paper. Here is how to produce one.

1# First-time setup: install.packages(c("modelsummary"))
2library(modelsummary)
3
4# Publication-quality table
5models <- list(
6"(1)" = lm_robust(log_wage ~ educ, data = df, se_type = "HC2"),
7"(2)" = lm_robust(log_wage ~ educ + exper + exper_sq + female,
8                   data = df, se_type = "HC2")
9)
10
11modelsummary(models,
12           stars = c('*' = 0.1, '**' = 0.05, '***' = 0.01),
13           gof_map = c("nobs", "r.squared"),
14           output = "default")

Requiresmodelsummary

Expected output: Publication-quality regression table

Table 1: Returns to Education — OLS Estimates

	(1) Bivariate	(2) Full Model
Education	0.0910***	0.0790***
	(0.0037)	(0.0033)
Experience		0.0310***
		(0.0025)
Experience-squared		-0.0005***
		(0.0001)
Female		-0.1520***
		(0.0115)
Intercept	1.2800***	1.1600***
	(0.0450)	(0.0520)
N	5,000	5,000
R-squared	0.250	0.400

Robust standard errors in parentheses. * p < 0.10, ** p < 0.05, *** p < 0.01.

Step 8: Exercises

Add interaction terms. Does the return to education differ for men and women? Add educ * female to your regression and interpret the interaction coefficient.
Try log-log. Replace educ with log(educ) and interpret the new coefficient as an elasticity.
Subsample analysis. Split the data by gender and estimate separate regressions. Compare the coefficients. Are the differences statistically significant?
Sensitivity check. Use the sensemakr package (R) or equivalent to assess how sensitive your estimates are to unobserved confounders (the Cinelli and Hazlett (2020) approach).

Summary

In this lab you learned:

OLS estimates the conditional mean relationship — causal interpretation requires additional assumptions
The coefficient changes when you add or remove controls — this instability is omitted variable bias in action
In most settings, use robust standard errors; cluster when the treatment varies at a group level
High VIF between a variable and its square is expected and not a problem
A regression table should report coefficients, standard errors (not t-stats), sample size, and R-squared
Careful interpretation and honest discussion of limitations are what distinguish good applied work

Overview#

Step 1: Load and Explore the Data#

Step 2: Simple Bivariate Regression#

Step 3: Add Controls and Watch OVB Shrink#

Step 4: Robust Standard Errors#

Step 5: Clustered Standard Errors#

Step 6: Diagnosing Problems#

Check for Multicollinearity#

Residual Plot#

Step 7: Present Results for Publication#

Step 8: Exercises#

Summary#