MethodAtlas
tutorial90 minutes

Lab: Poisson and Negative Binomial Regression

Model count data step by step. Learn to estimate Poisson and negative binomial regressions, test for overdispersion, compute incidence rate ratios, and choose between count models.

Overview

In this lab you will model firm-level patent counts using Poisson and negative binomial regressions. Patent data is a classic example of count data: non-negative integers with a right-skewed distribution and often more zeros than a Poisson distribution predicts. You will learn to diagnose overdispersion, compare models, and interpret results using incidence rate ratios.

What you will learn:

  • Why OLS is inappropriate for count outcomes
  • How to estimate a Poisson regression and interpret coefficients
  • How to test for overdispersion and estimate a negative binomial model
  • How to compute and interpret incidence rate ratios (IRRs)
  • How to compare Poisson and negative binomial models formally

Prerequisites: Familiarity with OLS regression and maximum likelihood estimation concepts.


Step 1: Simulate and Explore Patent Count Data

We simulate 2,000 firms with R&D spending, firm size, and age, generating patent counts from a negative binomial distribution (so overdispersion is present by design).

library(MASS)
library(modelsummary)
library(AER)

set.seed(42)
n <- 2000

log_rd <- rnorm(n, 3, 1.2)
log_emp <- rnorm(n, 5, 1.5)
firm_age <- round(runif(n, 1, 50))

xb <- -1.0 + 0.6 * log_rd + 0.2 * log_emp + 0.01 * firm_age
mu <- exp(xb)

alpha <- 0.8
patents <- rnbinom(n, size = 1/alpha, mu = mu)

df <- data.frame(patents, log_rd, log_emp, firm_age)

cat("Mean:", mean(patents), "\n")
cat("Variance:", var(patents), "\n")
cat("Var/Mean:", var(patents)/mean(patents), "\n")
cat("Zeros:", sum(patents == 0), "\n")

par(mfrow = c(1, 2))
hist(patents, breaks = 50, main = "Patent Counts", xlab = "Patents")
hist(patents[patents > 0], breaks = 30,
   main = "Conditional on Patenting", xlab = "Patents")

Expected output:

StatisticValue
Sample size2,000 firms
Mean patents~8–15
Variance of patents~80–250
Var/Mean ratio~8–20 (severe overdispersion)
Zeros~150–300 (8–15%)
Max patents~80–150

Sample data preview (first 5 rows):

firmpatentslog_rdlog_empfirm_age
032.876.1224
1124.014.5538
201.553.827
3253.927.0145
412.145.4312

Step 2: Estimate a Poisson Regression

# Poisson regression
poisson <- glm(patents ~ log_rd + log_emp + firm_age,
             data = df, family = poisson())

summary(poisson)

# Interpretation
cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
  round(exp(coef(poisson)["log_rd"]), 4), "\n")
cat("i.e., a", round((exp(coef(poisson)["log_rd"]) - 1) * 100, 1),
  "% increase.\n")

Expected output:

VariableCoefficientSEzp-valueIRR (exp(coef))
Intercept~-1.0~0.10~-10.0<0.001
log_rd~0.60~0.02~30.0<0.001~1.82
log_emp~0.20~0.01~20.0<0.001~1.22
firm_age~0.01~0.001~10.0<0.001~1.01

The true DGP coefficients are: intercept = -1.0, log_rd = 0.6, log_emp = 0.2, firm_age = 0.01. The Poisson estimates should be close to these values. The IRR for log_rd of approximately 1.82 means a 1-unit increase in log(R&D) is associated with an 82% increase in expected patents. However, the Poisson standard errors are too small because they ignore overdispersion.


Step 3: Test for Overdispersion

# Overdispersion test from AER package
disp_test <- dispersiontest(poisson, trafo = 1)
print(disp_test)

# Manual check: Pearson chi-squared / df
cat("\nPearson chi2 / df:", sum(residuals(poisson, type = "pearson")^2) /
  poisson$df.residual, "\n")
cat("(Should be ~1 under correct Poisson specification)\n")
RequiresAER

Expected output:

TestValue
Cameron-Trivedi alpha estimate~0.5–1.0
Cameron-Trivedi t-statistic> 5.0 (highly significant)
Cameron-Trivedi p-value< 0.001
Pearson chi-squared / df~5–15 (well above 1)
ConclusionReject H0: significant overdispersion detected

The Pearson chi-squared / df ratio far exceeds 1 (the Poisson expectation), confirming the variance is many times larger than the mean. This result is expected because the DGP uses a negative binomial with alpha = 0.8.

Concept Check

Your Poisson regression has a Pearson chi-squared / df ratio of 8.5. What does this tell you?


Step 4: Estimate a Negative Binomial Model

The negative binomial model adds a parameter to capture overdispersion.

# Negative binomial regression
negbin <- glm.nb(patents ~ log_rd + log_emp + firm_age, data = df)
summary(negbin)

# Compare coefficients
cat("\n=== Coefficient Comparison ===\n")
cat("Poisson coef on log_rd:", round(coef(poisson)["log_rd"], 4),
  " SE:", round(summary(poisson)$coefficients["log_rd", "Std. Error"], 4), "\n")
cat("NegBin coef on log_rd:", round(coef(negbin)["log_rd"], 4),
  " SE:", round(summary(negbin)$coefficients["log_rd", "Std. Error"], 4), "\n")

cat("\nTheta (1/alpha):", negbin$theta, "\n")
cat("Notice: NegBin SEs are larger (correctly reflecting overdispersion)\n")

Expected output:

VariablePoisson CoefPoisson SENegBin CoefNegBin SETrue
Intercept~-1.00~0.10~-1.00~0.18-1.0
log_rd~0.60~0.02~0.60~0.040.6
log_emp~0.20~0.01~0.20~0.030.2
firm_age~0.01~0.001~0.01~0.0020.01
Model ComparisonPoissonNegBin
Log-likelihood~-8500~-6500
AIC~17008~13010
Alpha (dispersion)~0.5–1.2 (true: 0.8)

Notice that the point estimates are nearly identical, but the NegBin standard errors are roughly 1.5–2.5 times larger than the Poisson SEs, correctly reflecting the extra variability from overdispersion.


Step 5: Incidence Rate Ratios

Exponentiating the coefficients gives incidence rate ratios (IRRs), which are the natural way to interpret count model results.

# Incidence Rate Ratios
irr <- exp(coef(negbin))
irr_ci <- exp(confint(negbin))

cat("=== Incidence Rate Ratios ===\n")
irr_table <- cbind(IRR = irr, irr_ci)
print(round(irr_table, 4))

cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
  round(irr["log_rd"], 3), "\n")

Expected output:

VariableIRR (exp(coef))95% CI Lower95% CI UpperInterpretation
log_rd~1.82~1.68~1.97+82% patents per unit log(R&D)
log_emp~1.22~1.15~1.30+22% patents per unit log(emp)
firm_age~1.01~1.006~1.014+1% patents per year of age

A 1-unit increase in log(R&D) — equivalent to a roughly 2.7-fold increase in R&D spending — is associated with 82% more expected patents. Since the covariate is in logs, the coefficient (0.60) is approximately the elasticity: a 1% increase in R&D spending is associated with a 0.60% increase in expected patents.


Step 6: Model Comparison and Diagnostics

# Likelihood ratio test
lr_stat <- 2 * (logLik(negbin) - logLik(poisson))
p_value <- pchisq(as.numeric(lr_stat), df = 1, lower.tail = FALSE) / 2
cat("LR test stat:", lr_stat, "  p-value:", p_value, "\n")

# AIC/BIC
cat("\nAIC - Poisson:", AIC(poisson), " NegBin:", AIC(negbin), "\n")
cat("BIC - Poisson:", BIC(poisson), " NegBin:", BIC(negbin), "\n")

# Rootogram (requires countreg or topmodels)
# install.packages("topmodels", repos = "https://R-Forge.R-project.org")
# library(topmodels)
# rootogram(negbin)

Expected output:

TestValue
LR test statistic (Poisson vs NegBin)> 500
LR p-value (one-sided)< 0.001
AIC — Poisson~17,000
AIC — NegBin~13,000
BIC — Poisson~17,030
BIC — NegBin~13,040
ConclusionNegBin strongly preferred
Concept Check

A colleague suggests using OLS on patent counts (or log(patents+1)) instead of a count model. What are the main problems with this approach?


Step 7: Exercises

Try these on your own:

  1. Zero-inflated model. If you suspect there is a separate process determining whether a firm patents at all (e.g., some firms never patent regardless of R&D), estimate a zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB). Use statsmodels.discrete.count_model.ZeroInflatedPoisson (Python), pscl::zeroinfl (R), or zinb (Stata).

  2. Exposure variable. Add an offset for firm size: offset(log_emp). This offset models the patent rate (patents per employee) instead of the patent count. How do the coefficients change?

  3. Robust Poisson. Re-estimate the Poisson model with robust (sandwich) standard errors. Compare these SEs to the negative binomial SEs. This comparison is the "quasi-Poisson" approach used widely in trade and innovation economics.

  4. Prediction. For a firm with log_rd = 4, log_emp = 6, and firm_age = 20, compute the predicted patent count and its 95% prediction interval.


Summary

In this lab you learned:

  • Count data require specialized models because OLS and log-OLS can produce biased and inconsistent estimates
  • Poisson regression assumes Var(Y|X) = E(Y|X); test this with the variance-to-mean ratio or the Cameron-Trivedi test
  • When overdispersion is present, the negative binomial model provides correct standard errors by estimating an extra dispersion parameter
  • Incidence rate ratios (IRRs = exp(beta)) are the natural way to report results from count models
  • Poisson with robust standard errors (quasi-Poisson) is a practical middle ground: Poisson regression requires correct specification of the conditional mean E[Y|X] = exp(Xbeta) for consistency, though standard errors remain valid under quasi-maximum likelihood estimation even if the variance function is misspecified
  • Comparing observed vs. predicted count distributions (rootograms) is a valuable visual diagnostic