MethodAtlas
replication120 minutes

Replication Lab: Hausman, Hall & Griliches (1984) Patents and R&D

Replicate the seminal Hausman, Hall & Griliches (1984) analysis of the relationship between R&D spending and patent counts. Estimate Poisson and Negative Binomial models, test for overdispersion, compare incidence rate ratios, and explore zero-inflated specifications.

Overview

Hausman et al.'s (1984) paper "Econometric Models for Count Data with an Application to the Patents-R&D Relationship" (Econometrica, 52(4), 909–938; DOI: 10.2307/1911191) is one of the foundational papers in count data econometrics. The paper develops econometric methods for modeling patent counts as a function of R&D expenditure and other firm characteristics, demonstrating that the Poisson model is often too restrictive and that Negative Binomial models better accommodate the overdispersion observed in patent data.

Key findings from HHG (1984):

  • The R&D elasticity of patenting is approximately 0.5-0.7 (a 10% increase in R&D is associated with a 5-7% increase in patents)
  • Patent counts exhibit significant overdispersion (variance far exceeds the mean)
  • The Negative Binomial model fits the data substantially better than Poisson
  • Firm size and sector matter for patent productivity

What you will learn:

  • How to estimate Poisson regression and interpret coefficients as log incidence rates
  • How to test for overdispersion and when to switch to Negative Binomial
  • How to compute and compare Incidence Rate Ratios (IRRs)
  • How to estimate zero-inflated models for excess zeros
  • How to assess model fit for count data

Prerequisites: OLS regression, maximum likelihood basics.


Step 1: Generate the Simulated Dataset

We simulate firm-level data with patent counts, R&D spending, firm size, and sector indicators.

library(MASS)
library(pscl)
library(modelsummary)
library(AER)

# Simulate firm-level patent data matching HHG (1984) patterns
set.seed(42)
n <- 346

log_rd <- rnorm(n, 3.5, 1.5)
log_sales <- log_rd + rnorm(n, 2.0, 0.8)
sector <- sample(c("pharma", "tech", "manufacturing", "other"),
               n, replace = TRUE, prob = c(0.25, 0.30, 0.30, 0.15))
age <- pmin(pmax(round(rnorm(n, 30, 15)), 3), 100)

sector_pharma <- as.integer(sector == "pharma")
sector_tech <- as.integer(sector == "tech")
sector_mfg <- as.integer(sector == "manufacturing")

log_mu <- -1.0 + 0.6 * log_rd + 0.05 * log_sales +
        0.3 * sector_pharma + 0.2 * sector_tech + 0.005 * age
mu <- exp(log_mu)

alpha <- 0.8
patents <- rnbinom(n, size = 1/alpha, mu = mu)

df <- data.frame(patents, log_rd, log_sales, sector, sector_pharma,
               sector_tech, sector_mfg, age, rd = exp(log_rd))

cat("Mean patents:", mean(df$patents), "\n")
cat("Variance:", var(df$patents), "\n")
cat("Var/Mean ratio:", var(df$patents) / mean(df$patents), "\n")
cat("Zeros:", sum(df$patents == 0), "\n")
summary(df)

Expected output:

StatisticValue
Sample size346 firms
Mean patents~10–20
Variance of patents~200–600
Var/Mean ratio~15–40 (severe overdispersion)
Zeros~30–60 (9–17%)
Max patents~100–250

Sample data preview (first 5 rows):

firmpatentslog_rdlog_salessectorage
083.245.85tech32
101.123.01other8
2455.187.42pharma41
332.894.76mfg25
4224.516.23tech38

Summary statistics by sector:

SectorNMean PatentsMean log_rdMean age
pharma~87~18~3.5~30
tech~104~15~3.5~30
manufacturing~104~12~3.5~30
other~52~10~3.5~30

Step 2: Estimate the Poisson Model

# Poisson regression
poisson <- glm(patents ~ log_rd + log_sales + sector_pharma +
             sector_tech + sector_mfg + age,
             data = df, family = poisson)

summary(poisson)

# Incidence Rate Ratios
cat("\nIncidence Rate Ratios:\n")
print(exp(cbind(IRR = coef(poisson), confint(poisson))))

cat("\nR&D elasticity:", coef(poisson)["log_rd"], "\n")

Expected output:

VariableCoefficientSEzp-valueIRR
Intercept~-1.0~0.15~-6.7<0.001
log_rd~0.60~0.02~30.0<0.001~1.82
log_sales~0.05~0.02~2.50.012~1.05
sector_pharma~0.30~0.04~7.5<0.001~1.35
sector_tech~0.20~0.04~5.0<0.001~1.22
sector_mfg~0.00~0.04~0.0~0.9~1.00
age~0.005~0.001~5.0<0.001~1.005

The R&D elasticity (coefficient on log_rd) of approximately 0.60 is within the 0.5–0.7 range reported in HHG (1984). The Poisson standard errors are too small due to overdispersion.


Step 3: Test for Overdispersion

# Cameron-Trivedi test using AER package
dispersiontest(poisson, trafo = 2)  # quadratic form

# Also check deviance/df ratio
cat("Deviance:", deviance(poisson), "\n")
cat("df:", poisson$df.residual, "\n")
cat("Deviance/df:", deviance(poisson) / poisson$df.residual, "\n")
cat("(Values >> 1 indicate overdispersion)\n")
RequiresAER

Expected output:

TestValue
Cameron-Trivedi alpha estimate~0.5–1.0
Cameron-Trivedi t-statistic> 5.0
Cameron-Trivedi p-value< 0.001
Poisson deviance~3,000–6,000
Degrees of freedom339
Deviance/df ratio~8–15 (well above 1)
ConclusionREJECT Poisson: significant overdispersion

The deviance/df ratio far exceeds 1, confirming the DGP's overdispersion (true alpha = 0.8). The Poisson variance assumption Var(Y) = E(Y) is severely violated.

Concept Check

The deviance/df ratio from the Poisson model is much greater than 1. What are the consequences of using Poisson regression when the data are overdispersed?


Step 4: Negative Binomial Regression

# Negative Binomial (NB2) regression
negbin <- glm.nb(patents ~ log_rd + log_sales + sector_pharma +
                sector_tech + sector_mfg + age, data = df)

summary(negbin)
cat("\nTheta (1/alpha):", negbin$theta, "\n")
cat("Alpha (overdispersion):", 1/negbin$theta, "\n")

# Compare models
modelsummary(list("Poisson" = poisson, "NegBin" = negbin),
           stars = c('*' = 0.1, '**' = 0.05, '***' = 0.01),
           gof_map = c("nobs", "AIC", "BIC", "logLik"))

# LR test comparing NegBin to Poisson
lr_stat <- 2 * (logLik(negbin) - logLik(poisson))
cat("\nLR test (NegBin vs Poisson):", lr_stat, "\n")
cat("p-value:", pchisq(as.numeric(lr_stat), df = 1, lower.tail = FALSE), "\n")
Requiresmodelsummary

Expected output:

VariablePoisson CoefPoisson SENegBin CoefNegBin SETrue
Intercept~-1.00~0.15~-1.00~0.28-1.0
log_rd~0.60~0.02~0.60~0.050.6
log_sales~0.05~0.02~0.05~0.040.05
sector_pharma~0.30~0.04~0.30~0.100.3
sector_tech~0.20~0.04~0.20~0.090.2
sector_mfg~0.00~0.04~0.00~0.090.0
age~0.005~0.001~0.005~0.0030.005
ComparisonPoissonNegBin
Log-likelihood~-2500~-1600
AIC~5014~3216
Alpha (dispersion)~0.5–1.2 (true: 0.8)
SE on log_rd~0.02~0.05
SE ratio (NegBin/Poisson)~2.5x

The LR test (Poisson vs. NegBin) at the bottom of the NegBin output should strongly reject the Poisson restriction (p < 0.001).


Step 5: Zero-Inflated Model

Many firms hold zero patents, possibly for structural reasons (they do not engage in patentable R&D at all). A zero-inflated model distinguishes "structural zeros" from "sampling zeros."

library(pscl)

# Zero-Inflated Poisson (ZIP)
zip_model <- zeroinfl(patents ~ log_rd + log_sales + sector_pharma +
                     sector_tech + sector_mfg + age |
                     log_rd + log_sales,  # inflation equation
                    data = df, dist = "poisson")
summary(zip_model)

# Zero-Inflated Negative Binomial (ZINB)
zinb_model <- zeroinfl(patents ~ log_rd + log_sales + sector_pharma +
                      sector_tech + sector_mfg + age |
                      log_rd + log_sales,
                     data = df, dist = "negbin")
summary(zinb_model)

# Vuong test (ZIP vs Poisson)
vuong(zip_model, poisson)

# Compare all models by AIC
cat("\nAIC comparison:\n")
cat("  Poisson:", AIC(poisson), "\n")
cat("  NegBin:", AIC(negbin), "\n")
cat("  ZIP:", AIC(zip_model), "\n")
cat("  ZINB:", AIC(zinb_model), "\n")
Requirespscl

Expected output:

ZIP Count EquationCoefficientSEzp-value
Intercept~-1.0~0.20~-5.0<0.001
log_rd~0.55~0.03~18.0<0.001
log_sales~0.05~0.03~1.7~0.09
sector_pharma~0.25~0.06~4.2<0.001
sector_tech~0.18~0.05~3.6<0.001
sector_mfg~0.00~0.05~0.0~0.9
age~0.005~0.002~2.50.012
ZIP Inflation (Zero) EquationCoefficientSEzp-value
Intercept~1.5~0.5~3.00.003
log_rd~-0.6~0.15~-4.0<0.001
log_sales~-0.3~0.12~-2.50.012
Model ComparisonAIC
Poisson~5,014
NegBin~3,216
ZIP~3,500

The negative coefficient on log_rd in the inflation equation means firms with more R&D spending are less likely to be "structural zeros" (firms that never patent).

Concept Check

In the zero-inflated model, the inflation equation predicts which observations are 'structural zeros.' What does a negative coefficient on log_rd in the inflation equation mean?


Step 6: Compare with Published Results

Summary of key comparisons:

FeatureHHG (1984)Our Simulation
R&D elasticity (Poisson)~0.67Check your log_rd coefficient
R&D elasticity (NegBin)~0.57Check your log_rd coefficient
Overdispersion detectedYes (strongly)Check your deviance/df ratio
NegBin preferred over PoissonYes (LR test)Check your LR test p-value
Variance/mean ratio>> 1Check your calculation

The key qualitative result is that (1) Poisson is rejected in favor of NegBin, (2) the R&D elasticity is positive and less than 1, and (3) accounting for overdispersion inflates standard errors but does not dramatically change point estimates.


Extension Exercises

  1. Robust Poisson. Re-estimate the Poisson model with robust (sandwich) standard errors. Compare these standard errors with the NegBin standard errors. When are they similar? This approach, advocated by Wooldridge, requires only correct specification of the conditional mean.

  2. Panel data. The original HHG (1984) data is a panel. Simulate a panel version of the data (firms observed over 8 years) and estimate fixed-effects Poisson and random-effects Negative Binomial models. How do the R&D elasticity estimates change?

  3. Hurdle model. Estimate a hurdle model that separates the decision to patent at all (logit) from how many patents to produce (truncated Poisson or NegBin). Compare its fit with the zero-inflated model. Which fits better, and why?

  4. Functional form. HHG test different functional forms for the R&D-patents relationship. Try including R&D in levels rather than logs, or adding R&D-squared. Does the elasticity interpretation change? Plot the estimated patent-R&D relationship under different specifications.

  5. Offset term. Add log(sales) as an offset rather than a covariate. This changes the interpretation to patents per unit of sales (patent intensity). How does the R&D coefficient change?


Summary

In this replication lab you learned:

  • Poisson regression models count data with an exponential conditional mean; coefficients are log incidence rate ratios
  • Real-world count data often exhibits overdispersion (variance > mean), making the Poisson variance assumption too restrictive
  • The Negative Binomial model adds a dispersion parameter and produces correctly-sized standard errors
  • Poisson point estimates remain consistent under overdispersion provided the conditional mean is correctly specified as E[Y|X] = exp(Xbeta), but standard errors are too small — use robust (sandwich) SEs, which remain valid under quasi-maximum likelihood estimation even if the variance function is misspecified, or switch to NegBin
  • Zero-inflated models distinguish between "structural zeros" (firms that never patent) and "sampling zeros" (firms that could patent but did not in this period)
  • Our simulated results reproduce the key finding from HHG (1984): the R&D elasticity of patenting is in the 0.5-0.7 range, and Negative Binomial is often preferred in applied settings over Poisson