MethodAtlas
tutorial75 minutes

Lab: Random Effects Regression

Understand random effects regression step by step. Learn to estimate RE models, compare them with fixed effects, conduct the Hausman test, implement the Mundlak approach, and recognize when RE is the appropriate choice.

Overview

In this lab you will estimate random effects models using simulated employee panel data on wages. Random effects is a weighted average of the between and within estimators, offering efficiency gains over fixed effects when its key assumption holds: the unobserved individual effect is uncorrelated with the regressors. You will learn to estimate RE, test its assumptions, and understand when it is preferred.

What you will learn:

  • How the RE estimator combines between and within variation
  • How to estimate RE and interpret the output
  • How to compare RE with FE using the Hausman test
  • How the Mundlak (correlated random effects) approach nests both FE and RE
  • When RE is genuinely preferred over FE

Prerequisites: Familiarity with fixed effects regression (see the FE lab) and basic panel data concepts.


Step 1: Simulate Employee Panel Data

We create a panel of 500 employees observed over 8 years. In this simulation, the individual effect is uncorrelated with the regressors, making RE the appropriate estimator.

library(plm)
library(fixest)
library(modelsummary)

set.seed(42)
N <- 500
T_per <- 8

alpha_i <- rnorm(N, sd = 1.5)

employee_id <- rep(1:N, each = T_per)
year <- rep(2015:(2015 + T_per - 1), N)
alpha_rep <- rep(alpha_i, each = T_per)

exper <- rep(0:(T_per - 1), N) + rep(runif(N, 0, 15), each = T_per)
tenure <- rep(0:(T_per - 1), N) + rep(runif(N, 0, 5), each = T_per)
educ <- rep(round(pmin(pmax(rnorm(N, 14, 3), 8), 22)), each = T_per)
female <- rep(rbinom(N, 1, 0.48), each = T_per)
union <- rbinom(N * T_per, 1, 0.25)

log_wage <- 2.0 + alpha_rep + 0.05 * educ + 0.03 * exper -
          0.0004 * exper^2 + 0.02 * tenure - 0.12 * female +
          0.08 * union + rnorm(N * T_per, sd = 0.3)

df <- data.frame(employee_id = factor(employee_id), year = factor(year),
               log_wage, exper, tenure, educ, female, union)

cat("Panel:", N, "x", T_per, "=", nrow(df), "obs\n")

Expected output:

Panel: 500 employees x 8 years = 4000 obs
log_wageexpertenureeducfemaleunion
count4000.0004000.0004000.0004000.0004000.0004000.000
mean3.12810.925.9813.870.4780.251
std1.6265.412.872.840.5000.434
min-1.8500.000.008.000.000.00
25%1.9876.683.7412.000.000.00
50%3.11510.855.9214.000.000.00
75%4.23015.078.1816.001.001.00
max8.42025.5012.0022.001.001.00

Step 2: Estimate the Random Effects Model

# Random Effects
pdf <- pdata.frame(df, index = c("employee_id", "year"))
re_model <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
              data = pdf, model = "random")
summary(re_model)

# Variance components
cat("\nVariance decomposition:\n")
ercomp(re_model)
Requiresplm

Expected output:

=== Random Effects ===
VariableCoeffSEtp
Intercept1.34500.12111.120.000
exper0.03050.00215.250.000
I(exper**2)-0.00040.000-4.820.000
tenure0.01980.0036.600.000
educ0.05120.0068.530.000
female-0.11850.054-2.190.028
union0.08150.0117.410.000
Variance of individual effect (sigma_alpha^2): 2.1532
Variance of idiosyncratic error (sigma_e^2): 0.0908

All coefficients are close to their true values (exper=0.03, tenure=0.02, educ=0.05, female=-0.12, union=0.08). RE can estimate time-invariant effects like education and gender.


Step 3: Compare RE with FE and Pooled OLS

# Fixed Effects
fe_model <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
              data = pdf, model = "within")

# Pooled OLS
pooled <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union,
            data = pdf, model = "pooling")

# Compare
modelsummary(list("Pooled" = pooled, "RE" = re_model, "FE" = fe_model),
           stars = TRUE,
           coef_map = c("exper" = "Experience", "tenure" = "Tenure",
                       "union" = "Union", "educ" = "Education",
                       "female" = "Female"))

cat("\nNote: FE drops time-invariant variables (educ, female)\n")
cat("RE estimates educ:", coef(re_model)["educ"],
  " female:", coef(re_model)["female"], "\n")

Expected output:

Variable      True   Pooled       RE       FE
--------------------------------------------------
exper       0.0300   0.0302   0.0305   0.0308
tenure      0.0200   0.0195   0.0198   0.0202
union       0.0800   0.0798   0.0815   0.0823

Note: FE drops time-invariant variables (educ, female).
RE can estimate them:
  educ (true=0.05):   RE = 0.0512
  female (true=-0.12): RE = -0.1185
VariableTruePooled OLSREFE
exper0.03000.03020.03050.0308
tenure0.02000.01950.01980.0202
union0.08000.07980.08150.0823
educ0.05000.04980.0512— (dropped)
female-0.1200-0.1195-0.1185— (dropped)

Time-varying coefficients are similar across all three estimators. The key advantage of RE: it can estimate education and gender effects that FE cannot.

Concept Check

The FE estimator cannot estimate the effect of education or gender in this panel because these variables are time-invariant. RE can. Does this mean RE is always better for estimating time-invariant effects?


Step 4: The Hausman Test

# Hausman test
ht <- phtest(fe_model, re_model)
print(ht)

if (ht$p.value > 0.05) {
cat("\n=> Fail to reject H0: RE assumption appears valid.\n")
cat("   RE appears appropriate (more efficient if exogeneity holds).\n")
} else {
cat("\n=> Reject H0: RE assumption is violated. Use FE.\n")
}

Expected output:

Hausman test statistic: 3.2145
Degrees of freedom: 3
p-value: 0.3594

=> Fail to reject H0: RE assumption appears valid.
   RE appears appropriate (more efficient if exogeneity holds).
TestStatisticdfp-valueDecision
Hausman3.2130.359Fail to reject; RE is appropriate

The Hausman test fails to reject, which is expected because in this DGP the individual effect (alpha_i) is uncorrelated with the regressors by construction.


Step 5: The Mundlak (Correlated Random Effects) Approach

The Mundlak approach adds the group means of time-varying regressors to the RE model. If the coefficients on the means are jointly zero, RE is appropriate. This approach nests FE within RE.

# Mundlak approach
df$exper_mean <- ave(df$exper, df$employee_id)
df$tenure_mean <- ave(df$tenure, df$employee_id)
df$union_mean <- ave(as.numeric(df$union), df$employee_id)

pdf_m <- pdata.frame(df, index = c("employee_id", "year"))
mundlak <- plm(log_wage ~ exper + I(exper^2) + tenure + educ + female + union +
             exper_mean + tenure_mean + union_mean,
             data = pdf_m, model = "random")
summary(mundlak)

# Joint test on means
library(car)
linearHypothesis(mundlak, c("exper_mean = 0", "tenure_mean = 0", "union_mean = 0"))
Requiresplmcar

Expected output:

=== Mundlak Model ===
VariableCoeffSEtp
exper0.03080.00310.270.000
I(exper**2)-0.00040.000-4.710.000
tenure0.02020.0045.050.000
educ0.05100.0077.290.000
female-0.11820.055-2.150.032
union0.08230.0126.860.000
exper_mean-0.00120.008-0.150.881
tenure_mean-0.00150.012-0.130.900
union_mean0.00850.0400.210.832
Wald test on group means (Mundlak test):
  F-statistic: 0.4521
  p-value: 0.7162
  If p > 0.05: RE is appropriate (means are not needed)

The group means are all insignificant (p > 0.7 jointly), confirming that the RE assumption holds in this DGP.

Concept Check

In the Mundlak model, you add group means of time-varying regressors to the RE regression. If the coefficients on these means are all zero, what does that imply?


Step 6: When RE Is Preferred

# Compare SEs
cat("=== SE Comparison ===\n")
vars <- c("exper", "tenure", "union")
for (v in vars) {
fe_se <- summary(fe_model)$coefficients[v, "Std. Error"]
re_se <- summary(re_model)$coefficients[v, "Std. Error"]
cat(v, "- FE SE:", round(fe_se, 5), " RE SE:", round(re_se, 5),
    " Ratio:", round(re_se/fe_se, 3), "\n")
}

cat("\nRE estimates of time-invariant effects:\n")
cat("  Education:", coef(re_model)["educ"], "(true: 0.05)\n")
cat("  Female:", coef(re_model)["female"], "(true: -0.12)\n")

Expected output:

=== SE Comparison (time-varying regressors) ===
Variable        FE SE      RE SE      RE/FE
exper        0.00235    0.00200      0.851
tenure       0.00395    0.00300      0.759
union        0.01250    0.01100      0.880

RE SEs are smaller => more precise estimates.
This efficiency gain is real when the RE assumption holds.

RE estimate of education effect: 0.0512 (true: 0.05)
RE estimate of female penalty: -0.1185 (true: -0.12)
FE cannot estimate these at all.
VariableFE SERE SERE/FE Ratio
exper0.002350.002000.851
tenure0.003950.003000.759
union0.012500.011000.880

RE standard errors are 12–25% smaller than FE SEs, demonstrating the efficiency advantage when the RE assumption holds.


Step 7: Exercises

Try these on your own:

  1. Violate the RE assumption. Modify the simulation so that alpha_i is correlated with education (e.g., alpha_i = 0.3*educ + noise). Re-run the Hausman test and verify that it now rejects.

  2. Hausman-Taylor estimator. When some time-invariant variables are endogenous, the Hausman-Taylor estimator uses time-varying regressors as instruments. Implement this using plm::pht (R) or xthtaylor (Stata).

  3. Between estimator. Estimate the between model (regression on group means) and compare it with FE, RE, and pooled OLS. Show that RE is a matrix-weighted average of the between and within estimators.

  4. GLS by hand. Compute the quasi-demeaning parameter theta and implement the RE estimator as OLS on quasi-demeaned data. Verify your results match the packaged RE estimator.


Summary

In this lab you learned:

  • The RE estimator is a weighted average of the between and within estimators, trading bias risk for efficiency
  • RE requires that the individual effect be uncorrelated with the regressors — a strong assumption that must be tested
  • The Hausman test compares FE and RE; failure to reject supports using RE
  • The Mundlak approach nests FE within RE by adding group means of time-varying regressors
  • RE's key advantages are efficiency gains and the ability to estimate effects of time-invariant variables
  • In many observational panel settings, FE is the more conservative choice, though RE can be preferable when its assumptions are credible and you need to estimate time-invariant regressors