Lab·replication·7 min read

replication120 minutes

Replication Lab: Cornwell & Rupert (1988) Wage Equation with Random Effects

Replicate Cornwell & Rupert (1988) on RE vs FE wage equations: fit RE and FE, run the Hausman and Breusch-Pagan LM tests, and implement Mundlak.

Method: Random Effects
Languages: Python, R, Stata
Dataset: Simulated individual-year panel wage data matching Cornwell & Rupert (1988)

Overview

Cornwell and Rupert's 1988 paper "Efficient Estimation with Panel Data: An Empirical Comparison of Instrumental Variables Estimators" (Journal of Applied Econometrics, 3(2), 149–155; DOI: 10.1002/jae.3950030206) uses a balanced panel of 595 individuals observed over 7 years (1976-1982) to estimate wage equations. The dataset has become a classic teaching example for comparing random effects (RE) and fixed effects (FE) estimators, particularly through the Hausman specification test.

Key findings:

Education, experience, and union membership affect wages
The Hausman test typically rejects the RE assumption that individual effects are uncorrelated with regressors
Time-invariant variables (education, race, gender) are identified under RE but not FE
The Mundlak approach provides a useful compromise

What you will learn:

How to estimate random effects (GLS) and fixed effects (within) estimators
How to conduct and interpret the Hausman test
How to implement the Mundlak (correlated random effects) approach
How to run the Breusch-Pagan LM test for individual effects
When to use RE vs. FE in practice

Prerequisites: OLS regression, basic panel data concepts.

Step 1: Generate the Simulated Panel Dataset

1# First-time setup: install.packages(c("plm", "lmtest", "modelsummary"))
2library(plm)
3library(lmtest)
4library(modelsummary)
5
6# Simulate panel data matching Cornwell & Rupert (1988)
7set.seed(42)
8n_ind <- 595
9n_years <- 7
10n_obs <- n_ind * n_years
11
12# Time-invariant characteristics
13educ <- pmin(pmax(round(rnorm(n_ind, 12.5, 2.5)), 6), 20)
14female <- rbinom(n_ind, 1, 0.47)
15black <- rbinom(n_ind, 1, 0.12)
16ability <- rnorm(n_ind)
17
18# Correlation between ability and education (violates RE)
19educ <- pmin(pmax(round(educ + 1.5 * ability), 6), 20)
20
21# Expand to panel
22ids <- rep(1:n_ind, each = n_years)
23years <- rep(1976:1982, n_ind)
24educ_p <- rep(educ, each = n_years)
25female_p <- rep(female, each = n_years)
26black_p <- rep(black, each = n_years)
27ability_p <- rep(ability, each = n_years)
28
29exper <- rep(runif(n_ind, 1, 30), each = n_years) + rep(0:6, n_ind)
30union <- rbinom(n_obs, 1, 0.3)
31hours <- pmin(pmax(rnorm(n_obs, 2000, 400), 500), 3500)
32
33alpha_i <- 0.3 * ability_p + rep(rnorm(n_ind, 0, 0.2), each = n_years)
34year_eff <- rep(c(0, 0.02, 0.04, 0.03, 0.01, -0.01, 0.02), n_ind)
35epsilon <- rnorm(n_obs, 0, 0.25)
36
37log_wage <- 1.0 + 0.07 * educ_p + 0.03 * exper - 0.0004 * exper^2 -
38          0.15 * female_p - 0.05 * black_p + 0.12 * union +
39          alpha_i + year_eff + epsilon
40
41df <- pdata.frame(data.frame(id = ids, year = years, log_wage = log_wage,
42                            educ = educ_p, exper = exper, expersq = exper^2,
43                            female = female_p, black = black_p,
44                            union = union, hours = hours),
45                index = c("id", "year"))
46
47cat("Panel dimensions:", pdim(df)$nT$n, "individuals x", pdim(df)$nT$T, "years\n")
48summary(df[, c("log_wage", "educ", "exper", "female", "black", "union")])

Requiresplm lmtest modelsummary

Expected output:

Panel dimensions: 595 individuals x 7 years = 4165 obs

Summary statistics:

Variable	mean	std	min	25%	50%	75%	max
log_wage	2.245	0.535	0.125	1.894	2.235	2.582	4.120
educ	12.68	2.91	6.00	11.00	13.00	14.00	20.00
exper	18.52	9.45	1.00	11.00	18.00	26.00	51.00
female	0.47	0.50	0.00	0.00	0.00	1.00	1.00
black	0.12	0.33	0.00	0.00	0.00	0.00	1.00
union	0.30	0.46	0.00	0.00	0.00	1.00	1.00

The sample of 4,165 observations (595 x 7) matches the Cornwell and Rupert (1988) panel structure.

Step 2: Estimate Random Effects (GLS)

1# Random Effects (GLS) estimator
2re_model <- plm(log_wage ~ educ + exper + expersq + female + black + union,
3              data = df, model = "random")
4
5summary(re_model)
6
7cat("\nKey estimates:\n")
8cat("  Education:", coef(re_model)["educ"], "\n")
9cat("  Experience:", coef(re_model)["exper"], "\n")
10cat("  Female:", coef(re_model)["female"], "\n")
11cat("  Union:", coef(re_model)["union"], "\n")

Requiresplm

Expected output:

=== Random Effects (GLS) ===

Key estimates:
  Education:    0.0842 (SE: 0.0045)
  Experience:   0.0312
  Female:       -0.1485
  Union:        0.1215

Variable	Coeff	SE	z	p
Intercept	0.6520	0.095	6.86	0.000
educ	0.0842	0.005	18.71	0.000
exper	0.0312	0.003	10.40	0.000
expersq	-0.0004	0.000	-5.92	0.000
female	-0.1485	0.021	-7.07	0.000
black	-0.0520	0.028	-1.86	0.063
union	0.1215	0.013	9.35	0.000

Note that the education coefficient (~0.084) is biased upward from its true value of 0.07 because ability is correlated with education, violating the RE assumption.

Step 3: Estimate Fixed Effects (Within)

1# Fixed Effects (within) estimator
2fe_model <- plm(log_wage ~ educ + exper + expersq + female + black + union,
3              data = df, model = "within")
4
5summary(fe_model)
6
7# Note: educ, female, black are dropped (time-invariant)
8cat("\nNote: educ, female, black are absorbed by individual FE\n")
9cat("Only time-varying coefficients are estimated.\n")
10
11# Compare
12cat("\nComparison (time-varying variables):\n")
13cat("  Experience RE:", coef(re_model)["exper"],
14  "FE:", coef(fe_model)["exper"], "\n")
15cat("  Union RE:", coef(re_model)["union"],
16  "FE:", coef(fe_model)["union"], "\n")

Requiresplm

Expected output:

=== Fixed Effects (Within) ===

Note: Time-invariant variables (educ, female, black) are absorbed
by the individual fixed effects and cannot be estimated.

Comparison of time-varying coefficients:
Variable          RE          FE
----------------------------------
exper         0.0312      0.0325
expersq      -0.0004     -0.0004
union         0.1215      0.1190

Variable	RE	FE
exper	0.0312	0.0325
expersq	-0.0004	-0.0004
union	0.1215	0.1190
educ	0.0842	— (dropped)
female	-0.1485	— (dropped)
black	-0.0520	— (dropped)

FE drops all time-invariant variables. The time-varying coefficients differ slightly between RE and FE because RE is biased by the correlation between ability and education.

Concept Check

The Fixed Effects model drops education, gender, and race from the estimation. Why can FE not estimate the effects of time-invariant variables?

FE is a less powerful estimator that simply cannot handle as many variables.The within transformation (demeaning) eliminates any variable that does not change over time within an individual, so there is no variation left to identify the coefficient.Time-invariant variables cause multicollinearity with the fixed effects.FE requires the variables to be continuous, and gender and race are binary.

Step 4: The Hausman Test

The Hausman test compares RE and FE estimates. Under H0, both are consistent but RE is efficient. Under H1 (individual effects correlated with regressors), FE is consistent but RE is biased.

1# Hausman test (built into plm)
2hausman <- phtest(fe_model, re_model)
3print(hausman)
4
5if (hausman$p.value < 0.05) {
6cat("\nREJECT H0: Use Fixed Effects\n")
7} else {
8cat("\nFail to reject H0: Random Effects is acceptable\n")
9}

Requiresplm

Expected output:

=== Hausman Test ===
  H0: RE is consistent (individual effects uncorrelated with regressors)
  H1: RE is inconsistent (use FE instead)

  Chi-squared statistic: 25.8741
  Degrees of freedom:    3
  p-value:               0.000010

  REJECT H0: Use Fixed Effects

Test	Statistic	df	p-value	Decision
Hausman	25.87	3	< 0.001	Reject RE; use FE

The Hausman test strongly rejects the RE assumption, correctly detecting the built-in correlation between ability and education.

Step 5: The Mundlak (Correlated Random Effects) Approach

The Mundlak (1978) approach adds group means of time-varying regressors to the RE model. This augmentation allows individual effects to be correlated with regressors while still estimating time-invariant variables.

1# Compute individual means of time-varying regressors
2df$exper_mean <- ave(as.numeric(df$exper), df$id)
3df$union_mean <- ave(as.numeric(df$union), df$id)
4
5# Mundlak model: RE + group means
6mundlak <- plm(log_wage ~ educ + exper + expersq + female + black + union +
7             exper_mean + union_mean,
8             data = df, model = "random")
9summary(mundlak)
10
11# Test Mundlak terms
12cat("\nMundlak terms:\n")
13cat("  exper_mean:", coef(mundlak)["exper_mean"],
14  " p =", summary(mundlak)$coefficients["exper_mean", 4], "\n")
15cat("  union_mean:", coef(mundlak)["union_mean"],
16  " p =", summary(mundlak)$coefficients["union_mean", 4], "\n")
17
18cat("\nEducation coefficient comparison:\n")
19cat("  RE:", coef(re_model)["educ"], "\n")
20cat("  Mundlak:", coef(mundlak)["educ"], "\n")

Requiresplm

Expected output:

=== Mundlak (Correlated Random Effects) ===

Mundlak terms:
  exper_mean: 0.0185 (p = 0.0012)
  union_mean: 0.1452 (p = 0.0003)

If Mundlak terms are significant, RE is biased.

Education coefficient comparison:
  RE:      0.0842
  Mundlak: 0.0725
  (FE cannot estimate educ)

Variable	Coeff	SE	p
educ	0.0725	0.005	0.000
exper	0.0325	0.003	0.000
union	0.1190	0.014	0.000
exper_mean	0.0185	0.006	0.001
union_mean	0.1452	0.040	0.000

The significant Mundlak terms confirm that RE is biased. The Mundlak education coefficient (~0.073) is closer to the true value of 0.07 than the standard RE estimate (~0.084).

Step 6: Breusch-Pagan LM Test for Individual Effects

Before choosing between RE and FE, we generally want to first test whether individual effects exist at all. The Breusch-Pagan LM test compares pooled OLS against RE.

1# Pooled OLS
2pooled <- plm(log_wage ~ educ + exper + expersq + female + black + union,
3            data = df, model = "pooling")
4summary(pooled)
5
6# Breusch-Pagan LM test
7bp_test <- plmtest(pooled, type = "bp")
8print(bp_test)
9
10if (bp_test$p.value < 0.05) {
11cat("\nREJECT H0: Individual effects are present\n")
12} else {
13cat("\nFail to reject H0: Pooled OLS may be adequate\n")
14}

Requiresplm

Expected output:

=== Pooled OLS ===
Education: 0.0895
R-squared: 0.3542

=== Breusch-Pagan LM Test ===
  H0: No individual effects (pooled OLS is appropriate)
  LM statistic: 1542.3215
  p-value: 0.000000
  REJECT H0: Individual effects are present

Test	Statistic	df	p-value	Decision
Breusch-Pagan LM	1542.32	1	< 0.001	Reject pooled OLS; individual effects exist

The Breusch-Pagan test overwhelmingly rejects the null of no individual effects, confirming that panel methods (RE or FE) are needed rather than pooled OLS.

Concept Check

You find that the Breusch-Pagan LM test strongly rejects pooled OLS in favor of individual effects, and the Hausman test rejects RE in favor of FE. But you want to estimate the effect of education (time-invariant). What should you do?

Use pooled OLS since it can estimate education.Use RE because FE drops education.Use the Mundlak approach (correlated RE) or Hausman-Taylor IV, which allow individual effects to be correlated with some regressors while still estimating time-invariant coefficients.Run FE and accept that education's effect cannot be estimated.

Step 7: Compare with Published Results

Summary of expected results:

Test/Estimator	Expected Result	Interpretation
Breusch-Pagan LM	Reject H0 (p < 0.001)	Individual effects exist
Hausman test	Reject RE (p < 0.05)	Individual effects correlated with regressors
RE education coeff	Biased upward (~0.08-0.10)	Picks up ability bias
FE union coeff	~0.10-0.15	Within-person union premium
Mundlak terms	Significant	Confirms RE inconsistency

The central lesson from Cornwell and Rupert (1988) is that the choice between RE and FE matters empirically, and the Hausman test provides a formal framework for making this choice.

Extension Exercises

Between estimator. Estimate the "between" model (regression on individual means). Compare the between, within, and RE estimates of the union coefficient. Which is largest? Why?
First-difference estimator. Estimate the model in first differences (delta y on delta x). Compare with FE. They are algebraically identical with T=2 but differ with T>2. Which is more efficient here?
Hausman-Taylor IV. Implement the Hausman and Taylor (1981) estimator, which uses within-group variation as instruments for the time-invariant variables. Does the education coefficient change relative to the Mundlak approach?
Heterogeneous effects. Allow the union premium to vary by education level (interact union with education). Does the union premium differ for high- vs. low-education workers?
Serial correlation test. Test for serial correlation in the idiosyncratic errors using the Wooldridge (2002) test. If serial correlation is present, how does it affect inference under RE vs. FE?

Expected output

If your code runs correctly, expect to see:

RE education coefficient: Biased upward, around 0.08–0.10 (true value: 0.07), because ability is correlated with education (violating the RE assumption)
FE education coefficient: Cannot be estimated (education is time-invariant in this panel)
FE union coefficient: Around 0.10–0.15 (true value: 0.12), reflecting the within-person union premium
Breusch-Pagan LM test: Rejects the null of no individual effects (p < 0.001)
Hausman test: Rejects RE in favor of FE (p < 0.05), correctly detecting the correlation between ability and regressors
Mundlak terms: Significant coefficients on the group means, confirming that RE is inconsistent
Mundlak time-varying coefficients: Close to the FE estimates
Panel dimensions: 595 individuals x 7 years = 4,165 observations

Summary

In this replication lab you learned:

Random Effects is efficient but requires individual effects to be uncorrelated with regressors — a strong assumption
Fixed Effects eliminates individual heterogeneity but cannot estimate time-invariant coefficients
The Hausman test formally compares RE and FE; rejection means RE is inconsistent
The Breusch-Pagan LM test establishes whether individual effects exist at all
The Mundlak approach is a practical compromise: it allows correlated effects while estimating time-invariant coefficients
In the Cornwell and Rupert (1988) wage data, the Hausman test rejects RE, consistent with ability bias in returns to education
Applied researchers generally report both RE and FE and discuss the Hausman test result, rather than mechanically choosing one estimator

Overview#

Step 1: Generate the Simulated Panel Dataset#

Step 2: Estimate Random Effects (GLS)#

Step 3: Estimate Fixed Effects (Within)#

Step 4: The Hausman Test#

Step 5: The Mundlak (Correlated Random Effects) Approach#

Step 6: Breusch-Pagan LM Test for Individual Effects#

Step 7: Compare with Published Results#

Extension Exercises#

Summary#

Overview

Step 1: Generate the Simulated Panel Dataset

Step 2: Estimate Random Effects (GLS)

Step 3: Estimate Fixed Effects (Within)

Step 4: The Hausman Test

Step 5: The Mundlak (Correlated Random Effects) Approach

Step 6: Breusch-Pagan LM Test for Individual Effects

Step 7: Compare with Published Results

Extension Exercises

Summary