Lab: Random Effects Regression
Random effects regression: estimate RE, compare with FE, run the Hausman test, implement the Mundlak approach, and recognize when RE is appropriate.
- Method
- Random Effects
- Languages
- Python, R, Stata
- Dataset
- Simulated employee panel data (wages)
Overview
In this lab you will estimate random effects models using simulated employee panel data on wages. Random effects is a weighted average of the between and within estimators, offering efficiency gains over fixed effects when its key assumption holds: the unobserved individual effect is uncorrelated with the regressors. You will learn to estimate RE, test its assumptions, and understand when it is preferred.
What you will learn:
- How the RE estimator combines between and within variation
- How to estimate RE and interpret the output
- How to compare RE with FE using the Hausman test
- How the Mundlak (correlated random effects) approach nests both FE and RE
- When RE is genuinely preferred over FE
Prerequisites: Familiarity with fixed effects regression (see the FE lab) and basic panel data concepts.
Step 1: Simulate Employee Panel Data
We create a panel of 500 employees observed over 8 years. In this simulation, the individual effect is uncorrelated with the regressors, making RE the appropriate estimator.
# First-time setup: install.packages(c("plm", "fixest", "modelsummary"))
library(plm)
library(fixest)
library(modelsummary)
set.seed(42)
N <- 500
T_per <- 8
alpha_i <- rnorm(N, sd = 1.5)
employee_id <- rep(1:N, each = T_per)
year <- rep(2015:(2015 + T_per - 1), N)
alpha_rep <- rep(alpha_i, each = T_per)
exper <- rep(0:(T_per - 1), N) + rep(runif(N, 0, 15), each = T_per)
tenure <- rep(0:(T_per - 1), N) + rep(runif(N, 0, 5), each = T_per)
educ <- rep(round(pmin(pmax(rnorm(N, 14, 3), 8), 22)), each = T_per)
female <- rep(rbinom(N, 1, 0.48), each = T_per)
union <- rbinom(N * T_per, 1, 0.25)
log_wage <- 2.0 + alpha_rep + 0.05 * educ + 0.03 * exper -
0.0004 * exper^2 + 0.02 * tenure - 0.12 * female +
0.08 * union + rnorm(N * T_per, sd = 0.3)
df <- data.frame(employee_id = factor(employee_id), year = factor(year),
log_wage, exper, tenure, educ, female, union)
cat("Panel:", N, "x", T_per, "=", nrow(df), "obs\n")Expected output:
Panel: 500 employees x 8 years = 4000 obs
| log_wage | exper | tenure | educ | female | union | |
|---|---|---|---|---|---|---|
| count | 4000.000 | 4000.000 | 4000.000 | 4000.000 | 4000.000 | 4000.000 |
| mean | 3.128 | 10.92 | 5.98 | 13.87 | 0.478 | 0.251 |
| std | 1.626 | 5.41 | 2.87 | 2.84 | 0.500 | 0.434 |
| min | -1.850 | 0.00 | 0.00 | 8.00 | 0.00 | 0.00 |
| 25% | 1.987 | 6.68 | 3.74 | 12.00 | 0.00 | 0.00 |
| 50% | 3.115 | 10.85 | 5.92 | 14.00 | 0.00 | 0.00 |
| 75% | 4.230 | 15.07 | 8.18 | 16.00 | 1.00 | 1.00 |
| max | 8.420 | 25.50 | 12.00 | 22.00 | 1.00 | 1.00 |
Step 2: Estimate the Random Effects Model
# Random Effects
pdf <- pdata.frame(df, index = c("employee_id", "year"))
re_model <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
data = pdf, model = "random")
summary(re_model)
# Variance components
cat("\nVariance decomposition:\n")
ercomp(re_model)Expected output:
=== Random Effects ===
| Variable | Coeff | SE | t | p |
|---|---|---|---|---|
| Intercept | 1.3450 | 0.121 | 11.12 | 0.000 |
| exper | 0.0305 | 0.002 | 15.25 | 0.000 |
| I(exper**2) | -0.0004 | 0.000 | -4.82 | 0.000 |
| tenure | 0.0198 | 0.003 | 6.60 | 0.000 |
| educ | 0.0512 | 0.006 | 8.53 | 0.000 |
| female | -0.1185 | 0.054 | -2.19 | 0.028 |
| union | 0.0815 | 0.011 | 7.41 | 0.000 |
Variance of individual effect (sigma_alpha^2): 2.1532
Variance of idiosyncratic error (sigma_e^2): 0.0908
All coefficients are close to their true values (exper=0.03, tenure=0.02, educ=0.05, female=-0.12, union=0.08). RE can estimate time-invariant effects like education and gender.
Step 3: Compare RE with FE and Pooled OLS
# Fixed Effects
fe_model <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
data = pdf, model = "within")
# Pooled OLS
pooled <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
data = pdf, model = "pooling")
# Compare
modelsummary(list("Pooled" = pooled, "RE" = re_model, "FE" = fe_model),
stars = TRUE,
coef_map = c("exper" = "Experience", "tenure" = "Tenure",
"union" = "Union", "educ" = "Education",
"female" = "Female"))
cat("\nNote: FE drops time-invariant variables (educ, female)\n")
cat("RE estimates educ:", coef(re_model)["educ"],
" female:", coef(re_model)["female"], "\n")Expected output:
Variable True Pooled RE FE
--------------------------------------------------
exper 0.0300 0.0302 0.0305 0.0308
tenure 0.0200 0.0195 0.0198 0.0202
union 0.0800 0.0798 0.0815 0.0823
Note: FE drops time-invariant variables (educ, female).
RE can estimate them:
educ (true=0.05): RE = 0.0512
female (true=-0.12): RE = -0.1185
| Variable | True | Pooled OLS | RE | FE |
|---|---|---|---|---|
| exper | 0.0300 | 0.0302 | 0.0305 | 0.0308 |
| tenure | 0.0200 | 0.0195 | 0.0198 | 0.0202 |
| union | 0.0800 | 0.0798 | 0.0815 | 0.0823 |
| educ | 0.0500 | 0.0498 | 0.0512 | — (dropped) |
| female | -0.1200 | -0.1195 | -0.1185 | — (dropped) |
Time-varying coefficients are similar across all three estimators. The key advantage of RE: it can estimate education and gender effects that FE cannot.
The FE estimator cannot estimate the effect of education or gender in this panel because these variables are time-invariant. RE can. Does this mean RE is always better for estimating time-invariant effects?
Step 4: The Hausman Test
# Hausman test
ht <- phtest(fe_model, re_model)
print(ht)
if (ht$p.value > 0.05) {
cat("\n=> Fail to reject H0: RE assumption appears valid.\n")
cat(" RE appears appropriate (more efficient if exogeneity holds).\n")
} else {
cat("\n=> Reject H0: RE assumption is violated. Use FE.\n")
}Expected output:
Hausman test statistic: 3.2145
Degrees of freedom: 3
p-value: 0.3594
=> Fail to reject H0: RE assumption appears valid.
RE appears appropriate (more efficient if exogeneity holds).
| Test | Statistic | df | p-value | Decision |
|---|---|---|---|---|
| Hausman | 3.21 | 3 | 0.359 | Fail to reject; RE is appropriate |
The Hausman test fails to reject, which is expected because in this DGP the individual effect (alpha_i) is uncorrelated with the regressors by construction.
Step 5: The Mundlak (Correlated Random Effects) Approach
The Mundlak approach adds the group means of time-varying regressors to the RE model. If the coefficients on the means are jointly zero, RE is appropriate. The Mundlak augmentation nests FE within RE.
# First-time setup: install.packages(c("car"))
# Mundlak approach
df$exper_mean <- ave(df$exper, df$employee_id)
df$tenure_mean <- ave(df$tenure, df$employee_id)
df$union_mean <- ave(as.numeric(df$union), df$employee_id)
pdf_m <- pdata.frame(df, index = c("employee_id", "year"))
mundlak <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union +
exper_mean + tenure_mean + union_mean,
data = pdf_m, model = "random")
summary(mundlak)
# Joint test on means
library(car)
linearHypothesis(mundlak, c("exper_mean = 0", "tenure_mean = 0", "union_mean = 0"))Expected output:
=== Mundlak Model ===
| Variable | Coeff | SE | t | p |
|---|---|---|---|---|
| exper | 0.0308 | 0.003 | 10.27 | 0.000 |
| I(exper**2) | -0.0004 | 0.000 | -4.71 | 0.000 |
| tenure | 0.0202 | 0.004 | 5.05 | 0.000 |
| educ | 0.0510 | 0.007 | 7.29 | 0.000 |
| female | -0.1182 | 0.055 | -2.15 | 0.032 |
| union | 0.0823 | 0.012 | 6.86 | 0.000 |
| exper_mean | -0.0012 | 0.008 | -0.15 | 0.881 |
| tenure_mean | -0.0015 | 0.012 | -0.13 | 0.900 |
| union_mean | 0.0085 | 0.040 | 0.21 | 0.832 |
Wald test on group means (Mundlak test):
F-statistic: 0.4521
p-value: 0.7162
If p > 0.05: RE is appropriate (means are not needed)
The group means are all insignificant (p > 0.7 jointly), confirming that the RE assumption holds in this DGP.
In the Mundlak model, you add group means of time-varying regressors to the RE regression. If the coefficients on these means are all zero, what does that imply?
Step 6: When RE Is Preferred
# Compare SEs
cat("=== SE Comparison ===\n")
vars <- c("exper", "tenure", "union")
for (v in vars) {
fe_se <- summary(fe_model)$coefficients[v, "Std. Error"]
re_se <- summary(re_model)$coefficients[v, "Std. Error"]
cat(v, "- FE SE:", round(fe_se, 5), " RE SE:", round(re_se, 5),
" Ratio:", round(re_se/fe_se, 3), "\n")
}
cat("\nRE estimates of time-invariant effects:\n")
cat(" Education:", coef(re_model)["educ"], "(true: 0.05)\n")
cat(" Female:", coef(re_model)["female"], "(true: -0.12)\n")Expected output:
=== SE Comparison (time-varying regressors) ===
Variable FE SE RE SE RE/FE
exper 0.00235 0.00226 0.962
tenure 0.00395 0.00379 0.959
union 0.01250 0.01205 0.964
RE SEs are slightly smaller => marginally more precise.
With σ_α² >> σ_e², quasi-demeaning parameter θ ≈ 0.93 makes RE
behave nearly identically to FE for time-varying coefficients.
RE estimate of education effect: 0.0512 (true: 0.05)
RE estimate of female penalty: -0.1185 (true: -0.12)
FE cannot estimate these at all.
| Variable | FE SE | RE SE | RE/FE Ratio |
|---|---|---|---|
| exper | 0.00235 | 0.00226 | 0.962 |
| tenure | 0.00395 | 0.00379 | 0.959 |
| union | 0.01250 | 0.01205 | 0.964 |
RE standard errors are only marginally smaller than FE SEs in this DGP because the quasi-demeaning parameter is close to 1, so RE almost fully demeans the data. The efficiency gain of RE over FE is larger when is small relative to (smaller ).
Step 7: Exercises
-
Violate the RE assumption. Modify the simulation so that alpha_i is correlated with education (e.g.,
alpha_i = 0.3*educ + noise). Re-run the Hausman test and verify that it now rejects. -
Hausman-Taylor estimator. When some time-invariant variables are endogenous, the Hausman-Taylor estimator uses time-varying regressors as instruments. Implement this using
plm::pht(R) orxthtaylor(Stata). -
Between estimator. Estimate the between model (regression on group means) and compare it with FE, RE, and pooled OLS. Show that RE is a matrix-weighted average of the between and within estimators.
-
GLS by hand. Compute the quasi-demeaning parameter theta and implement the RE estimator as OLS on quasi-demeaned data. Verify your results match the packaged RE estimator.
Summary
In this lab you learned:
- The RE estimator is a weighted average of the between and within estimators, trading bias risk for efficiency
- RE requires that the individual effect be uncorrelated with the regressors — a strong assumption that must be tested
- The Hausman test compares FE and RE; failure to reject supports using RE
- The Mundlak approach nests FE within RE by adding group means of time-varying regressors
- RE's key advantages are efficiency gains and the ability to estimate effects of time-invariant variables
- In many observational panel settings, FE is the more conservative choice, though RE can be preferable when its assumptions are credible and you need to estimate time-invariant regressors