Lab: Random Effects Regression
Understand random effects regression step by step. Learn to estimate RE models, compare them with fixed effects, conduct the Hausman test, implement the Mundlak approach, and recognize when RE is the appropriate choice.
Overview
In this lab you will estimate random effects models using simulated employee panel data on wages. Random effects is a weighted average of the between and within estimators, offering efficiency gains over fixed effects when its key assumption holds: the unobserved individual effect is uncorrelated with the regressors. You will learn to estimate RE, test its assumptions, and understand when it is preferred.
What you will learn:
- How the RE estimator combines between and within variation
- How to estimate RE and interpret the output
- How to compare RE with FE using the Hausman test
- How the Mundlak (correlated random effects) approach nests both FE and RE
- When RE is genuinely preferred over FE
Prerequisites: Familiarity with fixed effects regression (see the FE lab) and basic panel data concepts.
Step 1: Simulate Employee Panel Data
We create a panel of 500 employees observed over 8 years. In this simulation, the individual effect is uncorrelated with the regressors, making RE the appropriate estimator.
library(plm)
library(fixest)
library(modelsummary)
set.seed(42)
N <- 500
T_per <- 8
alpha_i <- rnorm(N, sd = 1.5)
employee_id <- rep(1:N, each = T_per)
year <- rep(2015:(2015 + T_per - 1), N)
alpha_rep <- rep(alpha_i, each = T_per)
exper <- rep(0:(T_per - 1), N) + rep(runif(N, 0, 15), each = T_per)
tenure <- rep(0:(T_per - 1), N) + rep(runif(N, 0, 5), each = T_per)
educ <- rep(round(pmin(pmax(rnorm(N, 14, 3), 8), 22)), each = T_per)
female <- rep(rbinom(N, 1, 0.48), each = T_per)
union <- rbinom(N * T_per, 1, 0.25)
log_wage <- 2.0 + alpha_rep + 0.05 * educ + 0.03 * exper -
0.0004 * exper^2 + 0.02 * tenure - 0.12 * female +
0.08 * union + rnorm(N * T_per, sd = 0.3)
df <- data.frame(employee_id = factor(employee_id), year = factor(year),
log_wage, exper, tenure, educ, female, union)
cat("Panel:", N, "x", T_per, "=", nrow(df), "obs\n")Expected output:
Panel: 500 employees x 8 years = 4000 obs
| log_wage | exper | tenure | educ | female | union | |
|---|---|---|---|---|---|---|
| count | 4000.000 | 4000.000 | 4000.000 | 4000.000 | 4000.000 | 4000.000 |
| mean | 3.128 | 10.92 | 5.98 | 13.87 | 0.478 | 0.251 |
| std | 1.626 | 5.41 | 2.87 | 2.84 | 0.500 | 0.434 |
| min | -1.850 | 0.00 | 0.00 | 8.00 | 0.00 | 0.00 |
| 25% | 1.987 | 6.68 | 3.74 | 12.00 | 0.00 | 0.00 |
| 50% | 3.115 | 10.85 | 5.92 | 14.00 | 0.00 | 0.00 |
| 75% | 4.230 | 15.07 | 8.18 | 16.00 | 1.00 | 1.00 |
| max | 8.420 | 25.50 | 12.00 | 22.00 | 1.00 | 1.00 |
Step 2: Estimate the Random Effects Model
# Random Effects
pdf <- pdata.frame(df, index = c("employee_id", "year"))
re_model <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
data = pdf, model = "random")
summary(re_model)
# Variance components
cat("\nVariance decomposition:\n")
ercomp(re_model)Expected output:
=== Random Effects ===
| Variable | Coeff | SE | t | p |
|---|---|---|---|---|
| Intercept | 1.3450 | 0.121 | 11.12 | 0.000 |
| exper | 0.0305 | 0.002 | 15.25 | 0.000 |
| I(exper**2) | -0.0004 | 0.000 | -4.82 | 0.000 |
| tenure | 0.0198 | 0.003 | 6.60 | 0.000 |
| educ | 0.0512 | 0.006 | 8.53 | 0.000 |
| female | -0.1185 | 0.054 | -2.19 | 0.028 |
| union | 0.0815 | 0.011 | 7.41 | 0.000 |
Variance of individual effect (sigma_alpha^2): 2.1532
Variance of idiosyncratic error (sigma_e^2): 0.0908
All coefficients are close to their true values (exper=0.03, tenure=0.02, educ=0.05, female=-0.12, union=0.08). RE can estimate time-invariant effects like education and gender.
Step 3: Compare RE with FE and Pooled OLS
# Fixed Effects
fe_model <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
data = pdf, model = "within")
# Pooled OLS
pooled <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
data = pdf, model = "pooling")
# Compare
modelsummary(list("Pooled" = pooled, "RE" = re_model, "FE" = fe_model),
stars = TRUE,
coef_map = c("exper" = "Experience", "tenure" = "Tenure",
"union" = "Union", "educ" = "Education",
"female" = "Female"))
cat("\nNote: FE drops time-invariant variables (educ, female)\n")
cat("RE estimates educ:", coef(re_model)["educ"],
" female:", coef(re_model)["female"], "\n")Expected output:
Variable True Pooled RE FE
--------------------------------------------------
exper 0.0300 0.0302 0.0305 0.0308
tenure 0.0200 0.0195 0.0198 0.0202
union 0.0800 0.0798 0.0815 0.0823
Note: FE drops time-invariant variables (educ, female).
RE can estimate them:
educ (true=0.05): RE = 0.0512
female (true=-0.12): RE = -0.1185
| Variable | True | Pooled OLS | RE | FE |
|---|---|---|---|---|
| exper | 0.0300 | 0.0302 | 0.0305 | 0.0308 |
| tenure | 0.0200 | 0.0195 | 0.0198 | 0.0202 |
| union | 0.0800 | 0.0798 | 0.0815 | 0.0823 |
| educ | 0.0500 | 0.0498 | 0.0512 | — (dropped) |
| female | -0.1200 | -0.1195 | -0.1185 | — (dropped) |
Time-varying coefficients are similar across all three estimators. The key advantage of RE: it can estimate education and gender effects that FE cannot.
The FE estimator cannot estimate the effect of education or gender in this panel because these variables are time-invariant. RE can. Does this mean RE is always better for estimating time-invariant effects?
Step 4: The Hausman Test
# Hausman test
ht <- phtest(fe_model, re_model)
print(ht)
if (ht$p.value > 0.05) {
cat("\n=> Fail to reject H0: RE assumption appears valid.\n")
cat(" RE appears appropriate (more efficient if exogeneity holds).\n")
} else {
cat("\n=> Reject H0: RE assumption is violated. Use FE.\n")
}Expected output:
Hausman test statistic: 3.2145
Degrees of freedom: 3
p-value: 0.3594
=> Fail to reject H0: RE assumption appears valid.
RE appears appropriate (more efficient if exogeneity holds).
| Test | Statistic | df | p-value | Decision |
|---|---|---|---|---|
| Hausman | 3.21 | 3 | 0.359 | Fail to reject; RE is appropriate |
The Hausman test fails to reject, which is expected because in this DGP the individual effect (alpha_i) is uncorrelated with the regressors by construction.
Step 5: The Mundlak (Correlated Random Effects) Approach
The Mundlak approach adds the group means of time-varying regressors to the RE model. If the coefficients on the means are jointly zero, RE is appropriate. This approach nests FE within RE.
# Mundlak approach
df$exper_mean <- ave(df$exper, df$employee_id)
df$tenure_mean <- ave(df$tenure, df$employee_id)
df$union_mean <- ave(as.numeric(df$union), df$employee_id)
pdf_m <- pdata.frame(df, index = c("employee_id", "year"))
mundlak <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union +
exper_mean + tenure_mean + union_mean,
data = pdf_m, model = "random")
summary(mundlak)
# Joint test on means
library(car)
linearHypothesis(mundlak, c("exper_mean = 0", "tenure_mean = 0", "union_mean = 0"))Expected output:
=== Mundlak Model ===
| Variable | Coeff | SE | t | p |
|---|---|---|---|---|
| exper | 0.0308 | 0.003 | 10.27 | 0.000 |
| I(exper**2) | -0.0004 | 0.000 | -4.71 | 0.000 |
| tenure | 0.0202 | 0.004 | 5.05 | 0.000 |
| educ | 0.0510 | 0.007 | 7.29 | 0.000 |
| female | -0.1182 | 0.055 | -2.15 | 0.032 |
| union | 0.0823 | 0.012 | 6.86 | 0.000 |
| exper_mean | -0.0012 | 0.008 | -0.15 | 0.881 |
| tenure_mean | -0.0015 | 0.012 | -0.13 | 0.900 |
| union_mean | 0.0085 | 0.040 | 0.21 | 0.832 |
Wald test on group means (Mundlak test):
F-statistic: 0.4521
p-value: 0.7162
If p > 0.05: RE is appropriate (means are not needed)
The group means are all insignificant (p > 0.7 jointly), confirming that the RE assumption holds in this DGP.
In the Mundlak model, you add group means of time-varying regressors to the RE regression. If the coefficients on these means are all zero, what does that imply?
Step 6: When RE Is Preferred
# Compare SEs
cat("=== SE Comparison ===\n")
vars <- c("exper", "tenure", "union")
for (v in vars) {
fe_se <- summary(fe_model)$coefficients[v, "Std. Error"]
re_se <- summary(re_model)$coefficients[v, "Std. Error"]
cat(v, "- FE SE:", round(fe_se, 5), " RE SE:", round(re_se, 5),
" Ratio:", round(re_se/fe_se, 3), "\n")
}
cat("\nRE estimates of time-invariant effects:\n")
cat(" Education:", coef(re_model)["educ"], "(true: 0.05)\n")
cat(" Female:", coef(re_model)["female"], "(true: -0.12)\n")Expected output:
=== SE Comparison (time-varying regressors) ===
Variable FE SE RE SE RE/FE
exper 0.00235 0.00200 0.851
tenure 0.00395 0.00300 0.759
union 0.01250 0.01100 0.880
RE SEs are smaller => more precise estimates.
This efficiency gain is real when the RE assumption holds.
RE estimate of education effect: 0.0512 (true: 0.05)
RE estimate of female penalty: -0.1185 (true: -0.12)
FE cannot estimate these at all.
| Variable | FE SE | RE SE | RE/FE Ratio |
|---|---|---|---|
| exper | 0.00235 | 0.00200 | 0.851 |
| tenure | 0.00395 | 0.00300 | 0.759 |
| union | 0.01250 | 0.01100 | 0.880 |
RE standard errors are 12–25% smaller than FE SEs, demonstrating the efficiency advantage when the RE assumption holds.
Step 7: Exercises
Try these on your own:
-
Violate the RE assumption. Modify the simulation so that alpha_i is correlated with education (e.g.,
alpha_i = 0.3*educ + noise). Re-run the Hausman test and verify that it now rejects. -
Hausman-Taylor estimator. When some time-invariant variables are endogenous, the Hausman-Taylor estimator uses time-varying regressors as instruments. Implement this using
plm::pht(R) orxthtaylor(Stata). -
Between estimator. Estimate the between model (regression on group means) and compare it with FE, RE, and pooled OLS. Show that RE is a matrix-weighted average of the between and within estimators.
-
GLS by hand. Compute the quasi-demeaning parameter theta and implement the RE estimator as OLS on quasi-demeaned data. Verify your results match the packaged RE estimator.
Summary
In this lab you learned:
- The RE estimator is a weighted average of the between and within estimators, trading bias risk for efficiency
- RE requires that the individual effect be uncorrelated with the regressors — a strong assumption that must be tested
- The Hausman test compares FE and RE; failure to reject supports using RE
- The Mundlak approach nests FE within RE by adding group means of time-varying regressors
- RE's key advantages are efficiency gains and the ability to estimate effects of time-invariant variables
- In many observational panel settings, FE is the more conservative choice, though RE can be preferable when its assumptions are credible and you need to estimate time-invariant regressors