Lab·tutorial·5 min read

tutorial90 minutes

Lab: Fixed Effects Regression

Master fixed effects regression: estimate within transformations by hand, compare pooled OLS with one-way and two-way FE, run Hausman, and cluster SEs.

Method: Fixed Effects (Two-Way FE)
Languages: Python, R, Stata
Dataset: Simulated firm panel data (productivity and training)

Overview

In this lab you will estimate the effect of worker training on firm productivity using simulated panel data. Fixed effects regression controls for unobserved, time-invariant characteristics by exploiting within-unit variation over time. You will build intuition by computing the within transformation manually, then use standard packages, add two-way fixed effects, and learn when and how to cluster standard errors.

What you will learn:

Why pooled OLS is biased when unobserved heterogeneity is correlated with regressors
How the within transformation eliminates time-invariant confounders
How to estimate one-way and two-way fixed effects models
How to perform the Hausman test (FE vs. RE)
How and why to cluster standard errors in panel data

Prerequisites: Familiarity with OLS regression and basic panel data concepts.

Step 1: Simulate Firm Panel Data

We create a balanced panel of 200 firms observed over 10 years. Each firm has an unobserved fixed ability that is correlated with the training variable, creating omitted variable bias in pooled OLS.

1# First-time setup: install.packages(c("fixest", "modelsummary"))
2library(fixest)
3library(modelsummary)
4
5set.seed(42)
6N <- 200        # Number of firms
7T_periods <- 10 # Number of years
8
9# Firm fixed effects: unobserved ability (correlated with training)
10firm_fe <- rnorm(N, sd = 2)
11# Year fixed effects: common aggregate time trend
12year_fe <- seq(0, 1, length.out = T_periods)
13
14# Build balanced panel indices: each firm observed in every year
15firm_id <- rep(1:N, each = T_periods)
16year <- rep(2010:(2010 + T_periods - 1), N)
17fe_i <- rep(firm_fe, each = T_periods)  # Repeat firm FE for each year
18fe_t <- rep(year_fe, N)                 # Repeat year FE for each firm
19
20# Training intensity: positively correlated with firm ability (creates OVB)
21training <- pmin(pmax(2 + 0.5 * fe_i + rnorm(N * T_periods), 0), 10)
22# Firm size control (time-varying, also correlated with ability)
23log_emp <- 4 + 0.3 * fe_i + rnorm(N * T_periods, sd = 0.5)
24
25# True DGP: true training effect = 0.30
26productivity <- fe_i + fe_t + 0.30 * training + 0.5 * log_emp +
27              rnorm(N * T_periods, sd = 1.5)
28
29df <- data.frame(firm_id = factor(firm_id), year = factor(year),
30               productivity, training, log_emp)
31
32cat("Panel:", N, "firms x", T_periods, "years =", nrow(df), "obs\n")

Requiresfixest modelsummary

Expected output:

Panel: 200 firms x 10 years = 2000 obs

	productivity	training
count	200.00	200.00
mean	4.52	2.15
std	2.74	1.38
min	-2.31	0.12
25%	2.58	1.19
50%	4.48	2.08
75%	6.41	3.04
max	11.85	5.67

Step 2: Pooled OLS (Biased Baseline)

1# Pooled OLS
2m_pooled <- feols(productivity ~ training + log_emp, data = df, vcov = "hetero")
3
4cat("=== Pooled OLS ===\n")
5cat("Coefficient on training:", coef(m_pooled)["training"], "\n")
6cat("True effect: 0.30\n")
7cat("Bias:", coef(m_pooled)["training"] - 0.30, "\n")

Expected output:

=== Pooled OLS ===
Coefficient on training: 0.5238
Standard error: 0.0282
True effect: 0.3000

Bias: 0.2238
The coefficient is biased upward because training is
correlated with unobserved firm ability.

Variable	Coeff	SE	t	p
Intercept	-0.1425	0.203	-0.70	0.483
training	0.5238	0.028	18.57	0.000
log_emp	0.8519	0.047	18.13	0.000

The training coefficient (~0.52) is far above the true value of 0.30, confirming upward bias from omitted firm ability.

Step 3: Within Transformation by Hand

To build intuition, let us compute the fixed effects estimator manually using the within (demeaning) transformation.

1# Within transformation by hand
2demean <- function(x, id) x - ave(x, id)
3
4df$prod_dm <- demean(df$productivity, df$firm_id)
5df$train_dm <- demean(df$training, df$firm_id)
6df$lemp_dm <- demean(df$log_emp, df$firm_id)
7
8m_within <- lm(prod_dm ~ train_dm + lemp_dm - 1, data = df)
9cat("=== Within Transformation (Manual FE) ===\n")
10cat("Coefficient on training:", coef(m_within)["train_dm"], "\n")
11cat("True effect: 0.30\n")
12cat("Bias:", coef(m_within)["train_dm"] - 0.30, "\n")

Expected output:

=== Within Transformation (Manual FE) ===
Coefficient on training: 0.3052
True effect: 0.3000
Bias: 0.0052

The within transformation eliminates firm fixed effects.
The estimate is now much closer to the true value.

Variable	Coeff	SE	t	p
training_demean	0.3052	0.035	8.72	0.000
log_emp_demean	0.4875	0.069	7.07	0.000

After demeaning, the training coefficient drops from ~0.52 to ~0.31, very close to the true value of 0.30.

Concept Check

Why does the within transformation (demeaning by firm) eliminate the omitted variable bias from firm ability?

Because demeaning removes outliers.Because firm ability is constant over time, so when you subtract each firm's mean, the ability component cancels out entirely. Only within-firm variation over time remains.Because OLS on demeaned data always gives unbiased estimates.Because the transformation increases the sample size.

Step 4: One-Way and Two-Way Fixed Effects

Now use standard packages to estimate FE models efficiently.

1# One-way FE using fixest
2m_fe1 <- feols(productivity ~ training + log_emp | firm_id,
3             data = df)
4
5# Two-way FE
6m_fe2 <- feols(productivity ~ training + log_emp | firm_id + year,
7             data = df)
8
9# Compare
10modelsummary(list("Pooled" = m_pooled, "FE (firm)" = m_fe1,
11                "FE (firm+year)" = m_fe2),
12           stars = c('*' = 0.1, '**' = 0.05, '***' = 0.01),
13           coef_map = c("training" = "Training", "log_emp" = "Log Employees"),
14           gof_map = c("nobs", "r.squared.within"))

Requiresfixest modelsummary

Expected output:

=== Comparison ===
Model                    Training coef         SE
Pooled OLS                      0.5238     0.0282
One-way FE (firm)               0.3052     0.0354
Two-way FE (firm+year)          0.3031     0.0355
True effect                     0.3000

Model	Training Coeff	SE	R-sq (within)
Pooled OLS	0.5238	0.0282	0.310
One-way FE (firm)	0.3052	0.0354	0.210
Two-way FE (firm+year)	0.3031	0.0355	0.215
True effect	0.3000	—	—

Adding firm fixed effects eliminates the bias. Adding year fixed effects produces only a minor change here because the year effects in the DGP are small.

Step 5: The Hausman Test

The Hausman test compares fixed effects (consistent but less efficient) with random effects (more efficient if the assumptions hold). A rejection means the RE assumptions are violated and FE should be used.

1# First-time setup: install.packages(c("plm"))
2library(plm)
3
4# Convert to pdata.frame for panel estimation with plm
5pdf <- pdata.frame(df, index = c("firm_id", "year"))
6
7# FE (within) and RE estimators
8fe_plm <- plm(productivity ~ training + log_emp, data = pdf, model = "within")
9re_plm <- plm(productivity ~ training + log_emp, data = pdf, model = "random")
10
11# Hausman test: H0 = RE is consistent (effects uncorrelated with regressors)
12# Rejection means FE should be used
13ht <- phtest(fe_plm, re_plm)
14print(ht)
15
16if (ht$p.value < 0.05) {
17cat("=> Reject H0: use Fixed Effects\n")
18} else {
19cat("=> Fail to reject: Random Effects may be appropriate\n")
20}

Requiresplm

Expected output:

Hausman test statistic: 28.45
p-value: 0.000001
=> Reject H0: use Fixed Effects (RE is inconsistent)

Test	Statistic	df	p-value	Decision
Hausman	28.45	2	< 0.001	Reject RE; use FE

The Hausman test strongly rejects the null that RE is consistent, confirming that firm effects are correlated with the regressors in this DGP.

Step 6: Cluster Standard Errors

In panel data, errors are typically correlated within firms over time. Clustering at the firm level accounts for this serial correlation.

1# Compare iid vs clustered standard errors
2m_fe_iid <- feols(productivity ~ training + log_emp | firm_id,
3                 data = df, vcov = "iid")    # Assumes independent errors
4m_fe_cluster <- feols(productivity ~ training + log_emp | firm_id,
5                     data = df, vcov = ~firm_id) # Accounts for within-firm correlation
6
7cat("=== Standard Errors on Training ===\n")
8cat("FE (iid):", se(m_fe_iid)["training"], "\n")
9cat("FE (clustered by firm):", se(m_fe_cluster)["training"], "\n")

Expected output:

=== Standard Errors on Training ===
  Pooled OLS (robust):     0.0282
  FE (robust):             0.0354
  FE (clustered by firm):  0.0410

Clustered SEs are typically larger because they account
for serial correlation within firms.

SE Type	SE on Training	Ratio to Robust
Pooled OLS (robust)	0.0282	—
FE (robust)	0.0354	1.00x
FE (clustered by firm)	0.0410	1.16x

Clustering inflates the standard errors by about 16%, reflecting within-firm serial correlation that robust SEs ignore.

Concept Check

You have a panel of 50 firms over 20 years and want to cluster standard errors at the firm level. Should you be concerned?

No — 50 clusters is plenty for asymptotic inference.Yes — with only 50 clusters, the asymptotic approximation may be poor. Consider the wild cluster bootstrap or the CR2/CR3 bias correction.The number of time periods matters more than the number of clusters.No — 50 firms times 20 years gives 1,000 observations, which is more than enough for reliable inference.

Step 7: Exercises

First differences. Estimate the model using first differences instead of the within transformation. Compare the coefficients. Under what conditions are they identical? (Hint: T = 2.)
Correlated random effects (Mundlak). Add the firm-level means of training and log_emp as additional regressors in a random effects model. Show that the coefficients on the time-varying variables match the FE estimates. This equivalence is the Mundlak (1978) approach.
Time-varying confounders. Add a time-varying confounder (e.g., investment) that is correlated with both training and productivity. Show that FE is also biased in this case.
Unbalanced panel. Randomly drop 20% of observations to create an unbalanced panel. Re-estimate the FE model and compare results. FE handles unbalanced panels naturally.

Summary

In this lab you learned:

Pooled OLS is biased when unobserved unit-specific heterogeneity is correlated with the regressors
The within transformation (demeaning) eliminates time-invariant confounders by comparing each unit to itself over time
Two-way fixed effects (unit + time) also control for common shocks affecting all units in a given period
The Hausman test compares FE and RE; rejection indicates the RE assumptions are violated
In most panel settings, clustering standard errors at least at the panel unit level is recommended; with few clusters, use the wild bootstrap
FE cannot estimate the effect of time-invariant variables or address time-varying confounders

Overview#

Step 1: Simulate Firm Panel Data#

Step 2: Pooled OLS (Biased Baseline)#

Step 3: Within Transformation by Hand#

Step 4: One-Way and Two-Way Fixed Effects#

Step 5: The Hausman Test#

Step 6: Cluster Standard Errors#

Step 7: Exercises#

Summary#

Overview

Step 1: Simulate Firm Panel Data

Step 2: Pooled OLS (Biased Baseline)

Step 3: Within Transformation by Hand

Step 4: One-Way and Two-Way Fixed Effects

Step 5: The Hausman Test

Step 6: Cluster Standard Errors

Step 7: Exercises

Summary