Lab·replication·8 min read

replication120 minutes

Replication Lab: Hausman, Hall & Griliches (1984) Patents and R&D

Replicate Hausman, Hall & Griliches (1984) on R&D and patents: fit Poisson and negative binomial, test overdispersion, compare IRRs, try zero-inflated models.

Method: Poisson / Negative Binomial
Languages: Python, R, Stata
Dataset: Simulated firm-level patent count data matching HHG (1984) summary statistics

Overview

Hausman et al. (1984)'s paper "Econometric Models for Count Data with an Application to the Patents-R&D Relationship" is one of the foundational papers in count data econometrics. The paper develops econometric methods for modeling patent counts as a function of R&D expenditure and other firm characteristics, demonstrating that the Poisson model is often too restrictive and that Negative Binomial models better accommodate the overdispersion observed in patent data.

Key findings from HHG (1984):

The R&D elasticity of patenting is approximately 0.5-0.7 (a 10% increase in R&D is associated with a 5-7% increase in patents)
Patent counts exhibit significant overdispersion (variance far exceeds the mean)
The Negative Binomial model fits the data substantially better than Poisson
Firm size and sector matter for patent productivity

What you will learn:

How to estimate Poisson regression and interpret coefficients as log incidence rates
How to test for overdispersion and when to switch to Negative Binomial
How to compute and compare Incidence Rate Ratios (IRRs)
How to estimate zero-inflated models for excess zeros
How to assess model fit for count data

Prerequisites: OLS regression, maximum likelihood basics.

Step 1: Generate the Simulated Dataset

We simulate firm-level data with patent counts, R&D spending, firm size, and sector indicators.

1# First-time setup: install.packages(c("MASS", "pscl", "modelsummary", "AER"))
2library(MASS)
3library(pscl)
4library(modelsummary)
5library(AER)
6
7# Simulate firm-level patent data matching HHG (1984) patterns
8set.seed(42)
9n <- 346
10
11log_rd <- rnorm(n, 3.5, 1.5)
12log_sales <- log_rd + rnorm(n, 2.0, 0.8)
13sector <- sample(c("pharma", "tech", "manufacturing", "other"),
14               n, replace = TRUE, prob = c(0.25, 0.30, 0.30, 0.15))
15age <- pmin(pmax(round(rnorm(n, 30, 15)), 3), 100)
16
17sector_pharma <- as.integer(sector == "pharma")
18sector_tech <- as.integer(sector == "tech")
19sector_mfg <- as.integer(sector == "manufacturing")
20
21log_mu <- -1.0 + 0.6 * log_rd + 0.05 * log_sales +
22        0.3 * sector_pharma + 0.2 * sector_tech + 0.005 * age
23mu <- exp(log_mu)
24
25alpha <- 0.8
26patents <- rnbinom(n, size = 1/alpha, mu = mu)
27
28df <- data.frame(patents, log_rd, log_sales, sector, sector_pharma,
29               sector_tech, sector_mfg, age, rd = exp(log_rd))
30
31cat("Mean patents:", mean(df$patents), "\n")
32cat("Variance:", var(df$patents), "\n")
33cat("Var/Mean ratio:", var(df$patents) / mean(df$patents), "\n")
34cat("Zeros:", sum(df$patents == 0), "\n")
35summary(df)

RequiresMASS pscl modelsummary AER

Expected output:

Statistic	Value
Sample size	346 firms
Mean patents	~10–20
Variance of patents	~200–600
Var/Mean ratio	~15–40 (severe overdispersion)
Zeros	~30–60 (9–17%)
Max patents	~100–250

Sample data preview (first 5 rows):

firm	patents	log_rd	log_sales	sector	age
0	8	3.24	5.85	tech	32
1	0	1.12	3.01	other	8
2	45	5.18	7.42	pharma	41
3	3	2.89	4.76	mfg	25
4	22	4.51	6.23	tech	38

Summary statistics by sector:

Sector	N	Mean Patents	Mean log_rd	Mean age
pharma	~87	~18	~3.5	~30
tech	~104	~15	~3.5	~30
manufacturing	~104	~12	~3.5	~30
other	~52	~10	~3.5	~30

Step 2: Estimate the Poisson Model

1# Poisson regression
2poisson <- glm(patents ~ log_rd + log_sales + sector_pharma +
3             sector_tech + sector_mfg + age,
4             data = df, family = poisson)
5
6summary(poisson)
7
8# Incidence Rate Ratios
9cat("\nIncidence Rate Ratios:\n")
10print(exp(cbind(IRR = coef(poisson), confint(poisson))))
11
12cat("\nR&D elasticity:", coef(poisson)["log_rd"], "\n")

Expected output:

Variable	Coefficient	SE	z	p-value	IRR
Intercept	~-1.0	~0.15	~-6.7	<0.001	—
log_rd	~0.60	~0.02	~30.0	<0.001	~1.82
log_sales	~0.05	~0.02	~2.5	0.012	~1.05
sector_pharma	~0.30	~0.04	~7.5	<0.001	~1.35
sector_tech	~0.20	~0.04	~5.0	<0.001	~1.22
sector_mfg	~0.00	~0.04	~0.0	~0.9	~1.00
age	~0.005	~0.001	~5.0	<0.001	~1.005

The R&D elasticity (coefficient on log_rd) of approximately 0.60 is within the 0.5–0.7 range reported in Hausman et al. (1984). The Poisson standard errors are too small due to overdispersion.

Step 3: Test for Overdispersion

1# Cameron-Trivedi test using AER package
2dispersiontest(poisson, trafo = 2)  # quadratic form
3
4# Also check deviance/df ratio
5cat("Deviance:", deviance(poisson), "\n")
6cat("df:", poisson$df.residual, "\n")
7cat("Deviance/df:", deviance(poisson) / poisson$df.residual, "\n")
8cat("(Values >> 1 indicate overdispersion)\n")

RequiresAER

Expected output:

Test	Value
Cameron-Trivedi alpha estimate	~0.5–1.0
Cameron-Trivedi t-statistic	> 5.0
Cameron-Trivedi p-value	< 0.001
Poisson deviance	~3,000–6,000
Degrees of freedom	339
Deviance/df ratio	~8–15 (well above 1)
Conclusion	REJECT Poisson: significant overdispersion

The deviance/df ratio far exceeds 1, confirming the DGP's overdispersion (true alpha = 0.8). The Poisson variance assumption Var(Y) = E(Y) is severely violated.

Concept Check

The deviance/df ratio from the Poisson model is much greater than 1. What are the consequences of using Poisson regression when the data are overdispersed?

The coefficient estimates are biased.The coefficient estimates are consistent but the standard errors are too small, leading to spuriously significant results.The model cannot be estimated at all.The predicted values will be negative for some observations.

Step 4: Negative Binomial Regression

1# Negative Binomial (NB2) regression
2negbin <- glm.nb(patents ~ log_rd + log_sales + sector_pharma +
3                sector_tech + sector_mfg + age, data = df)
4
5summary(negbin)
6cat("\nTheta (1/alpha):", negbin$theta, "\n")
7cat("Alpha (overdispersion):", 1/negbin$theta, "\n")
8
9# Compare models
10modelsummary(list("Poisson" = poisson, "NegBin" = negbin),
11           stars = c('*' = 0.1, '**' = 0.05, '***' = 0.01),
12           gof_map = c("nobs", "AIC", "BIC", "logLik"))
13
14# LR test comparing NegBin to Poisson
15lr_stat <- 2 * (logLik(negbin) - logLik(poisson))
16cat("\nLR test (NegBin vs Poisson):", lr_stat, "\n")
17cat("p-value:", pchisq(as.numeric(lr_stat), df = 1, lower.tail = FALSE) / 2, "\n")
18cat("(halved for boundary test: alpha=0 is on the parameter space boundary)\n")

Requiresmodelsummary

Expected output:

Variable	Poisson Coef	Poisson SE	NegBin Coef	NegBin SE	True
Intercept	~-1.00	~0.15	~-1.00	~0.28	-1.0
log_rd	~0.60	~0.02	~0.60	~0.05	0.6
log_sales	~0.05	~0.02	~0.05	~0.04	0.05
sector_pharma	~0.30	~0.04	~0.30	~0.10	0.3
sector_tech	~0.20	~0.04	~0.20	~0.09	0.2
sector_mfg	~0.00	~0.04	~0.00	~0.09	0.0
age	~0.005	~0.001	~0.005	~0.003	0.005

Comparison	Poisson	NegBin
Log-likelihood	~-2500	~-1600
AIC	~5014	~3216
Alpha (dispersion)	—	~0.5–1.2 (true: 0.8)
SE on log_rd	~0.02	~0.05
SE ratio (NegBin/Poisson)	—	~2.5x

The LR test (Poisson vs. NegBin) at the bottom of the NegBin output should strongly reject the Poisson restriction (p < 0.001).

Step 5: Zero-Inflated Model

Many firms hold zero patents, possibly for structural reasons (they do not engage in patentable R&D at all). A zero-inflated model distinguishes "structural zeros" from "sampling zeros."

1# First-time setup: install.packages(c("pscl"))
2library(pscl)
3
4# Zero-Inflated Poisson (ZIP)
5zip_model <- zeroinfl(patents ~ log_rd + log_sales + sector_pharma +
6                     sector_tech + sector_mfg + age |
7                     log_rd + log_sales,  # inflation equation
8                    data = df, dist = "poisson")
9summary(zip_model)
10
11# Zero-Inflated Negative Binomial (ZINB)
12zinb_model <- zeroinfl(patents ~ log_rd + log_sales + sector_pharma +
13                      sector_tech + sector_mfg + age |
14                      log_rd + log_sales,
15                     data = df, dist = "negbin")
16summary(zinb_model)
17
18# Vuong test: refit Poisson as a single-component model for vuong()
19poisson_plain <- glm(patents ~ log_rd + log_sales + sector_pharma +
20                   sector_tech + sector_mfg + age,
21                   data = df, family = poisson)
22vuong(zip_model, poisson_plain)
23
24# Compare all models by AIC
25cat("\nAIC comparison:\n")
26cat("  Poisson:", AIC(poisson), "\n")
27cat("  NegBin:", AIC(negbin), "\n")
28cat("  ZIP:", AIC(zip_model), "\n")
29cat("  ZINB:", AIC(zinb_model), "\n")

Requirespscl

Expected output:

ZIP Count Equation	Coefficient	SE	z	p-value
Intercept	~-1.0	~0.20	~-5.0	<0.001
log_rd	~0.55	~0.03	~18.0	<0.001
log_sales	~0.05	~0.03	~1.7	~0.09
sector_pharma	~0.25	~0.06	~4.2	<0.001
sector_tech	~0.18	~0.05	~3.6	<0.001
sector_mfg	~0.00	~0.05	~0.0	~0.9
age	~0.005	~0.002	~2.5	0.012

ZIP Inflation (Zero) Equation	Coefficient	SE	z	p-value
Intercept	~1.5	~0.5	~3.0	0.003
log_rd	~-0.6	~0.15	~-4.0	<0.001
log_sales	~-0.3	~0.12	~-2.5	0.012

Model Comparison	AIC
Poisson	~5,014
NegBin	~3,216
ZIP	~3,500

The negative coefficient on log_rd in the inflation equation means firms with more R&D spending are less likely to be "structural zeros" (firms that never patent).

Concept Check

In the zero-inflated model, the inflation equation predicts which observations are 'structural zeros.' What does a negative coefficient on log_rd in the inflation equation mean?

Firms with more R&D are less likely to be in the 'always zero' group — they are more likely to be patent-producing firms.Firms with more R&D produce fewer patents.R&D has no effect on patenting.The model is misspecified because R&D should not appear in the inflation equation.

Step 6: Compare with Published Results

Summary of key comparisons:

Feature	HHG (1984)	Our Simulation
R&D elasticity (Poisson)	~0.67	Check your log_rd coefficient
R&D elasticity (NegBin)	~0.57	Check your log_rd coefficient
Overdispersion detected	Yes (strongly)	Check your deviance/df ratio
NegBin preferred over Poisson	Yes (LR test)	Check your LR test p-value
Variance/mean ratio	>> 1	Check your calculation

The key qualitative result is that (1) Poisson is rejected in favor of NegBin, (2) the R&D elasticity is positive and less than 1, and (3) accounting for overdispersion inflates standard errors but does not dramatically change point estimates.

Extension Exercises

Robust Poisson. Re-estimate the Poisson model with robust (sandwich) standard errors. Compare these standard errors with the NegBin standard errors. When are they similar? The Poisson PPML approach, advocated by Wooldridge, requires only correct specification of the conditional mean.
Panel data. The original HHG (1984) dataset is a panel. Simulate a panel version of the data (firms observed over 8 years) and estimate fixed-effects Poisson and random-effects Negative Binomial models. How do the R&D elasticity estimates change?
Hurdle model. Estimate a hurdle model that separates the decision to patent at all (logit) from how many patents to produce (truncated Poisson or NegBin). Compare its fit with the zero-inflated model. Which fits better, and why?
Functional form. HHG test different functional forms for the R&D-patents relationship. Try including R&D in levels rather than logs, or adding R&D-squared. Does the elasticity interpretation change? Plot the estimated patent-R&D relationship under different specifications.
Offset term. Add log(sales) as an offset rather than a covariate. This substitution changes the interpretation to patents per unit of sales (patent intensity). How does the R&D coefficient change?

Summary

In this replication lab you learned:

Poisson regression models count data with an exponential conditional mean; coefficients are log incidence rate ratios
Real-world count data often exhibits overdispersion (variance > mean), making the Poisson variance assumption too restrictive
The Negative Binomial model adds a dispersion parameter and produces correctly-sized standard errors
Poisson point estimates remain consistent under overdispersion provided the conditional mean is correctly specified as E[Y|X] = exp(Xbeta), but standard errors are too small — use robust (sandwich) SEs, which remain valid under quasi-maximum likelihood estimation even if the variance function is misspecified, or switch to NegBin
Zero-inflated models distinguish between "structural zeros" (firms that never patent) and "sampling zeros" (firms that could patent but did not in this period)
Our simulated results reproduce the key finding from Hausman et al. (1984): the R&D elasticity of patenting is in the 0.5-0.7 range, and Negative Binomial is often preferred in applied settings over Poisson

Overview#

Step 1: Generate the Simulated Dataset#

Step 2: Estimate the Poisson Model#

Step 3: Test for Overdispersion#

Step 4: Negative Binomial Regression#

Step 5: Zero-Inflated Model#

Step 6: Compare with Published Results#

Extension Exercises#

Summary#

Overview

Step 1: Generate the Simulated Dataset

Step 2: Estimate the Poisson Model

Step 3: Test for Overdispersion

Step 4: Negative Binomial Regression

Step 5: Zero-Inflated Model

Step 6: Compare with Published Results

Extension Exercises

Summary