Replication Lab: Cornwell & Rupert (1988) Wage Equation with Random Effects
Replicate the Cornwell & Rupert (1988) comparison of random effects and fixed effects wage equations. Estimate RE and FE models, conduct the Hausman test, implement the Mundlak approach, and run the Breusch-Pagan LM test using simulated panel data.
Overview
Cornwell and Rupert's 1988 paper "Efficient Estimation with Panel Data: An Empirical Comparison of Instrumental Variables Estimators" (Journal of Applied Econometrics, 3(2), 149–155; DOI: 10.1002/jae.3950030206) uses a balanced panel of 595 individuals observed over 7 years (1976-1982) to estimate wage equations. The dataset has become a classic teaching example for comparing random effects (RE) and fixed effects (FE) estimators, particularly through the Hausman specification test.
Key findings:
- Education, experience, and union membership affect wages
- The Hausman test typically rejects the RE assumption that individual effects are uncorrelated with regressors
- Time-invariant variables (education, race, gender) are identified under RE but not FE
- The Mundlak approach provides a useful compromise
What you will learn:
- How to estimate random effects (GLS) and fixed effects (within) estimators
- How to conduct and interpret the Hausman test
- How to implement the Mundlak (correlated random effects) approach
- How to run the Breusch-Pagan LM test for individual effects
- When to use RE vs. FE in practice
Prerequisites: OLS regression, basic panel data concepts.
Step 1: Generate the Simulated Panel Dataset
library(plm)
library(lmtest)
library(modelsummary)
# Simulate panel data matching Cornwell & Rupert (1988)
set.seed(42)
n_ind <- 595
n_years <- 7
n_obs <- n_ind * n_years
# Time-invariant characteristics
educ <- pmin(pmax(round(rnorm(n_ind, 12.5, 2.5)), 6), 20)
female <- rbinom(n_ind, 1, 0.47)
black <- rbinom(n_ind, 1, 0.12)
ability <- rnorm(n_ind)
# Correlation between ability and education (violates RE)
educ <- pmin(pmax(round(educ + 1.5 * ability), 6), 20)
# Expand to panel
ids <- rep(1:n_ind, each = n_years)
years <- rep(1976:1982, n_ind)
educ_p <- rep(educ, each = n_years)
female_p <- rep(female, each = n_years)
black_p <- rep(black, each = n_years)
ability_p <- rep(ability, each = n_years)
exper <- rep(runif(n_ind, 1, 30), each = n_years) + rep(0:6, n_ind)
union <- rbinom(n_obs, 1, 0.3)
hours <- pmin(pmax(rnorm(n_obs, 2000, 400), 500), 3500)
alpha_i <- 0.3 * ability_p + rep(rnorm(n_ind, 0, 0.2), each = n_years)
year_eff <- rep(c(0, 0.02, 0.04, 0.03, 0.01, -0.01, 0.02), n_ind)
epsilon <- rnorm(n_obs, 0, 0.25)
log_wage <- 1.0 + 0.07 * educ_p + 0.03 * exper - 0.0004 * exper^2 -
0.15 * female_p - 0.05 * black_p + 0.12 * union +
alpha_i + year_eff + epsilon
df <- pdata.frame(data.frame(id = ids, year = years, log_wage = log_wage,
educ = educ_p, exper = exper, expersq = exper^2,
female = female_p, black = black_p,
union = union, hours = hours),
index = c("id", "year"))
cat("Panel dimensions:", pdim(df)$nT$n, "individuals x", pdim(df)$nT$T, "years\n")
summary(df[, c("log_wage", "educ", "exper", "female", "black", "union")])Expected output:
Panel dimensions: 595 individuals x 7 years = 4165 obs
Summary statistics:
| Variable | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|
| log_wage | 2.245 | 0.535 | 0.125 | 1.894 | 2.235 | 2.582 | 4.120 |
| educ | 12.68 | 2.91 | 6.00 | 11.00 | 13.00 | 14.00 | 20.00 |
| exper | 18.52 | 9.45 | 1.00 | 11.00 | 18.00 | 26.00 | 51.00 |
| female | 0.47 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
| black | 0.12 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| union | 0.30 | 0.46 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
The sample of 4,165 observations (595 x 7) matches the Cornwell and Rupert (1988) panel structure.
Step 2: Estimate Random Effects (GLS)
# Random Effects (GLS) estimator
re_model <- plm(log_wage ~ educ + exper + expersq + female + black + union,
data = df, model = "random")
summary(re_model)
cat("\nKey estimates:\n")
cat(" Education:", coef(re_model)["educ"], "\n")
cat(" Experience:", coef(re_model)["exper"], "\n")
cat(" Female:", coef(re_model)["female"], "\n")
cat(" Union:", coef(re_model)["union"], "\n")Expected output:
=== Random Effects (GLS) ===
Key estimates:
Education: 0.0842 (SE: 0.0045)
Experience: 0.0312
Female: -0.1485
Union: 0.1215
| Variable | Coeff | SE | z | p |
|---|---|---|---|---|
| Intercept | 0.6520 | 0.095 | 6.86 | 0.000 |
| educ | 0.0842 | 0.005 | 18.71 | 0.000 |
| exper | 0.0312 | 0.003 | 10.40 | 0.000 |
| expersq | -0.0004 | 0.000 | -5.92 | 0.000 |
| female | -0.1485 | 0.021 | -7.07 | 0.000 |
| black | -0.0520 | 0.028 | -1.86 | 0.063 |
| union | 0.1215 | 0.013 | 9.35 | 0.000 |
Note that the education coefficient (~0.084) is biased upward from its true value of 0.07 because ability is correlated with education, violating the RE assumption.
Step 3: Estimate Fixed Effects (Within)
# Fixed Effects (within) estimator
fe_model <- plm(log_wage ~ educ + exper + expersq + female + black + union,
data = df, model = "within")
summary(fe_model)
# Note: educ, female, black are dropped (time-invariant)
cat("\nNote: educ, female, black are absorbed by individual FE\n")
cat("Only time-varying coefficients are estimated.\n")
# Compare
cat("\nComparison (time-varying variables):\n")
cat(" Experience RE:", coef(re_model)["exper"],
"FE:", coef(fe_model)["exper"], "\n")
cat(" Union RE:", coef(re_model)["union"],
"FE:", coef(fe_model)["union"], "\n")Expected output:
=== Fixed Effects (Within) ===
Note: Time-invariant variables (educ, female, black) are absorbed
by the individual fixed effects and cannot be estimated.
Comparison of time-varying coefficients:
Variable RE FE
----------------------------------
exper 0.0312 0.0325
expersq -0.0004 -0.0004
union 0.1215 0.1190
| Variable | RE | FE |
|---|---|---|
| exper | 0.0312 | 0.0325 |
| expersq | -0.0004 | -0.0004 |
| union | 0.1215 | 0.1190 |
| educ | 0.0842 | — (dropped) |
| female | -0.1485 | — (dropped) |
| black | -0.0520 | — (dropped) |
FE drops all time-invariant variables. The time-varying coefficients differ slightly between RE and FE because RE is biased by the correlation between ability and education.
The Fixed Effects model drops education, gender, and race from the estimation. Why can FE not estimate the effects of time-invariant variables?
Step 4: The Hausman Test
The Hausman test compares RE and FE estimates. Under H0, both are consistent but RE is efficient. Under H1 (individual effects correlated with regressors), FE is consistent but RE is biased.
# Hausman test (built into plm)
hausman <- phtest(fe_model, re_model)
print(hausman)
if (hausman$p.value < 0.05) {
cat("\nREJECT H0: Use Fixed Effects\n")
} else {
cat("\nFail to reject H0: Random Effects is acceptable\n")
}Expected output:
=== Hausman Test ===
H0: RE is consistent (individual effects uncorrelated with regressors)
H1: RE is inconsistent (use FE instead)
Chi-squared statistic: 25.8741
Degrees of freedom: 3
p-value: 0.000010
REJECT H0: Use Fixed Effects
| Test | Statistic | df | p-value | Decision |
|---|---|---|---|---|
| Hausman | 25.87 | 3 | < 0.001 | Reject RE; use FE |
The Hausman test strongly rejects the RE assumption, correctly detecting the built-in correlation between ability and education.
Step 5: The Mundlak (Correlated Random Effects) Approach
The Mundlak (1978) approach adds group means of time-varying regressors to the RE model. This augmentation allows individual effects to be correlated with regressors while still estimating time-invariant variables.
# Compute individual means of time-varying regressors
df$exper_mean <- ave(as.numeric(df$exper), df$id)
df$union_mean <- ave(as.numeric(df$union), df$id)
# Mundlak model: RE + group means
mundlak <- plm(log_wage ~ educ + exper + expersq + female + black + union +
exper_mean + union_mean,
data = df, model = "random")
summary(mundlak)
# Test Mundlak terms
cat("\nMundlak terms:\n")
cat(" exper_mean:", coef(mundlak)["exper_mean"],
" p =", summary(mundlak)$coefficients["exper_mean", 4], "\n")
cat(" union_mean:", coef(mundlak)["union_mean"],
" p =", summary(mundlak)$coefficients["union_mean", 4], "\n")
cat("\nEducation coefficient comparison:\n")
cat(" RE:", coef(re_model)["educ"], "\n")
cat(" Mundlak:", coef(mundlak)["educ"], "\n")Expected output:
=== Mundlak (Correlated Random Effects) ===
Mundlak terms:
exper_mean: 0.0185 (p = 0.0012)
union_mean: 0.1452 (p = 0.0003)
If Mundlak terms are significant, RE is biased.
Education coefficient comparison:
RE: 0.0842
Mundlak: 0.0725
(FE cannot estimate educ)
| Variable | Coeff | SE | p |
|---|---|---|---|
| educ | 0.0725 | 0.005 | 0.000 |
| exper | 0.0325 | 0.003 | 0.000 |
| union | 0.1190 | 0.014 | 0.000 |
| exper_mean | 0.0185 | 0.006 | 0.001 |
| union_mean | 0.1452 | 0.040 | 0.000 |
The significant Mundlak terms confirm that RE is biased. The Mundlak education coefficient (~0.073) is closer to the true value of 0.07 than the standard RE estimate (~0.084).
Step 6: Breusch-Pagan LM Test for Individual Effects
Before choosing between RE and FE, we should first test whether individual effects exist at all. The Breusch-Pagan LM test compares pooled OLS against RE.
# Pooled OLS
pooled <- plm(log_wage ~ educ + exper + expersq + female + black + union,
data = df, model = "pooling")
summary(pooled)
# Breusch-Pagan LM test
bp_test <- plmtest(pooled, type = "bp")
print(bp_test)
if (bp_test$p.value < 0.05) {
cat("\nREJECT H0: Individual effects are present\n")
} else {
cat("\nFail to reject H0: Pooled OLS may be adequate\n")
}Expected output:
=== Pooled OLS ===
Education: 0.0895
R-squared: 0.3542
=== Breusch-Pagan LM Test ===
H0: No individual effects (pooled OLS is appropriate)
LM statistic: 1542.3215
p-value: 0.000000
REJECT H0: Individual effects are present
| Test | Statistic | df | p-value | Decision |
|---|---|---|---|---|
| Breusch-Pagan LM | 1542.32 | 1 | < 0.001 | Reject pooled OLS; individual effects exist |
The Breusch-Pagan test overwhelmingly rejects the null of no individual effects, confirming that panel methods (RE or FE) are needed rather than pooled OLS.
You find that the Breusch-Pagan LM test strongly rejects pooled OLS in favor of individual effects, and the Hausman test rejects RE in favor of FE. But you want to estimate the effect of education (time-invariant). What should you do?
Step 7: Compare with Published Results
Summary of expected results:
| Test/Estimator | Expected Result | Interpretation |
|---|---|---|
| Breusch-Pagan LM | Reject H0 (p < 0.001) | Individual effects exist |
| Hausman test | Reject RE (p < 0.05) | Individual effects correlated with regressors |
| RE education coeff | Biased upward (~0.08-0.10) | Picks up ability bias |
| FE union coeff | ~0.10-0.15 | Within-person union premium |
| Mundlak terms | Significant | Confirms RE inconsistency |
The central lesson from Cornwell and Rupert (1988) is that the choice between RE and FE matters empirically, and the Hausman test provides a formal framework for making this choice.
Extension Exercises
-
Between estimator. Estimate the "between" model (regression on individual means). Compare the between, within, and RE estimates of the union coefficient. Which is largest? Why?
-
First-difference estimator. Estimate the model in first differences (delta y on delta x). Compare with FE. They are algebraically identical with T=2 but differ with T>2. Which is more efficient here?
-
Hausman-Taylor IV. Implement the Hausman-Taylor (1981) estimator, which uses within-group variation as instruments for the time-invariant variables. Does the education coefficient change relative to the Mundlak approach?
-
Heterogeneous effects. Allow the union premium to vary by education level (interact union with education). Does the union premium differ for high- vs. low-education workers?
-
Serial correlation test. Test for serial correlation in the idiosyncratic errors using the Wooldridge (2002) test. If serial correlation is present, how does it affect inference under RE vs. FE?
Summary
In this replication lab you learned:
- Random Effects is efficient but requires individual effects to be uncorrelated with regressors — a strong assumption
- Fixed Effects eliminates individual heterogeneity but cannot estimate time-invariant coefficients
- The Hausman test formally compares RE and FE; rejection means RE is inconsistent
- The Breusch-Pagan LM test establishes whether individual effects exist at all
- The Mundlak approach is a practical compromise: it allows correlated effects while estimating time-invariant coefficients
- In the Cornwell and Rupert (1988) wage data, the Hausman test rejects RE, consistent with ability bias in returns to education
- Applied researchers should report both RE and FE and discuss the Hausman test result, rather than mechanically choosing one estimator