Lab: Poisson and Negative Binomial Regression
Model count data: estimate Poisson and negative binomial regressions, test overdispersion, compute incidence rate ratios, and choose between count models.
- Languages
- Python, R, Stata
- Dataset
- Simulated patent count data
Overview
In this lab you will model firm-level patent counts using Poisson and negative binomial regressions. Patent data are a classic example of count data: non-negative integers with a right-skewed distribution and often more zeros than a Poisson distribution predicts. You will learn to diagnose overdispersion, compare models, and interpret results using incidence rate ratios.
What you will learn:
- Why OLS is inappropriate for count outcomes
- How to estimate a Poisson regression and interpret coefficients
- How to test for overdispersion and estimate a negative binomial model
- How to compute and interpret incidence rate ratios (IRRs)
- How to compare Poisson and negative binomial models formally
Prerequisites: Familiarity with OLS regression and maximum likelihood estimation concepts.
Step 1: Simulate and Explore Patent Count Data
We simulate 2,000 firms with R&D spending, firm size, and age, generating patent counts from a negative binomial distribution (so overdispersion is present by design).
# First-time setup: install.packages(c("MASS", "modelsummary", "AER"))
library(MASS)
library(modelsummary)
library(AER)
set.seed(42)
n <- 2000
log_rd <- rnorm(n, 3, 1.2)
log_emp <- rnorm(n, 5, 1.5)
firm_age <- round(runif(n, 1, 50))
xb <- -1.0 + 0.6 * log_rd + 0.2 * log_emp + 0.01 * firm_age
mu <- exp(xb)
alpha <- 0.8
patents <- rnbinom(n, size = 1/alpha, mu = mu)
df <- data.frame(patents, log_rd, log_emp, firm_age)
cat("Mean:", mean(patents), "\n")
cat("Variance:", var(patents), "\n")
cat("Var/Mean:", var(patents)/mean(patents), "\n")
cat("Zeros:", sum(patents == 0), "\n")
par(mfrow = c(1, 2))
hist(patents, breaks = 50, main = "Patent Counts", xlab = "Patents")
hist(patents[patents > 0], breaks = 30,
main = "Conditional on Patenting", xlab = "Patents")Expected output:
| Statistic | Value |
|---|---|
| Sample size | 2,000 firms |
| Mean patents | ~8–15 |
| Variance of patents | ~80–250 |
| Var/Mean ratio | ~8–20 (severe overdispersion) |
| Zeros | ~150–300 (8–15%) |
| Max patents | ~80–150 |
Sample data preview (first 5 rows):
| firm | patents | log_rd | log_emp | firm_age |
|---|---|---|---|---|
| 0 | 3 | 2.87 | 6.12 | 24 |
| 1 | 12 | 4.01 | 4.55 | 38 |
| 2 | 0 | 1.55 | 3.82 | 7 |
| 3 | 25 | 3.92 | 7.01 | 45 |
| 4 | 1 | 2.14 | 5.43 | 12 |
Step 2: Estimate a Poisson Regression
# Poisson regression
poisson <- glm(patents ~ log_rd + log_emp + firm_age,
data = df, family = poisson())
summary(poisson)
# Interpretation
cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
round(exp(coef(poisson)["log_rd"]), 4), "\n")
cat("i.e., a", round((exp(coef(poisson)["log_rd"]) - 1) * 100, 1),
"% increase.\n")Expected output:
| Variable | Coefficient | SE | z | p-value | IRR (exp(coef)) |
|---|---|---|---|---|---|
| Intercept | ~-1.0 | ~0.10 | ~-10.0 | <0.001 | — |
| log_rd | ~0.60 | ~0.02 | ~30.0 | <0.001 | ~1.82 |
| log_emp | ~0.20 | ~0.01 | ~20.0 | <0.001 | ~1.22 |
| firm_age | ~0.01 | ~0.001 | ~10.0 | <0.001 | ~1.01 |
The true DGP coefficients are: intercept = -1.0, log_rd = 0.6, log_emp = 0.2, firm_age = 0.01. The Poisson estimates should be close to these values. The IRR for log_rd of approximately 1.82 means a 1-unit increase in log(R&D) is associated with an 82% increase in expected patents. However, the Poisson standard errors are too small because they ignore overdispersion.
Step 3: Test for Overdispersion
# Overdispersion test from AER package
disp_test <- dispersiontest(poisson, trafo = 1)
print(disp_test)
# Manual check: Pearson chi-squared / df
cat("\nPearson chi2 / df:", sum(residuals(poisson, type = "pearson")^2) /
poisson$df.residual, "\n")
cat("(Should be ~1 under correct Poisson specification)\n")Expected output:
| Test | Value |
|---|---|
| Cameron-Trivedi alpha estimate | ~0.5–1.0 |
| Cameron-Trivedi t-statistic | > 5.0 (highly significant) |
| Cameron-Trivedi p-value | < 0.001 |
| Pearson chi-squared / df | ~5–15 (well above 1) |
| Conclusion | Reject H0: significant overdispersion detected |
The Pearson chi-squared / df ratio far exceeds 1 (the Poisson expectation), confirming the variance is many times larger than the mean. The overdispersion is expected because the DGP uses a negative binomial with alpha = 0.8.
Your Poisson regression has a Pearson chi-squared / df ratio of 8.5. What does this tell you?
Step 4: Estimate a Negative Binomial Model
The negative binomial model adds a parameter to capture overdispersion.
# Negative binomial regression
negbin <- glm.nb(patents ~ log_rd + log_emp + firm_age, data = df)
summary(negbin)
# Compare coefficients
cat("\n=== Coefficient Comparison ===\n")
cat("Poisson coef on log_rd:", round(coef(poisson)["log_rd"], 4),
" SE:", round(summary(poisson)$coefficients["log_rd", "Std. Error"], 4), "\n")
cat("NegBin coef on log_rd:", round(coef(negbin)["log_rd"], 4),
" SE:", round(summary(negbin)$coefficients["log_rd", "Std. Error"], 4), "\n")
cat("\nTheta (1/alpha):", negbin$theta, "\n")
cat("Notice: NegBin SEs are larger (correctly reflecting overdispersion)\n")Expected output:
| Variable | Poisson Coef | Poisson SE | NegBin Coef | NegBin SE | True |
|---|---|---|---|---|---|
| Intercept | ~-1.00 | ~0.10 | ~-1.00 | ~0.18 | -1.0 |
| log_rd | ~0.60 | ~0.02 | ~0.60 | ~0.04 | 0.6 |
| log_emp | ~0.20 | ~0.01 | ~0.20 | ~0.03 | 0.2 |
| firm_age | ~0.01 | ~0.001 | ~0.01 | ~0.002 | 0.01 |
| Model Comparison | Poisson | NegBin |
|---|---|---|
| Log-likelihood | ~-8500 | ~-6500 |
| AIC | ~17008 | ~13010 |
| Alpha (dispersion) | — | ~0.5–1.2 (true: 0.8) |
Notice that the point estimates are nearly identical, but the NegBin standard errors are roughly 1.5–2.5 times larger than the Poisson SEs, correctly reflecting the extra variability from overdispersion.
Step 5: Incidence Rate Ratios
Exponentiating the coefficients gives incidence rate ratios (IRRs), which are the natural way to interpret count model results.
# Incidence Rate Ratios
irr <- exp(coef(negbin))
irr_ci <- exp(confint(negbin))
cat("=== Incidence Rate Ratios ===\n")
irr_table <- cbind(IRR = irr, irr_ci)
print(round(irr_table, 4))
cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
round(irr["log_rd"], 3), "\n")Expected output:
| Variable | IRR (exp(coef)) | 95% CI Lower | 95% CI Upper | Interpretation |
|---|---|---|---|---|
| log_rd | ~1.82 | ~1.68 | ~1.97 | +82% patents per unit log(R&D) |
| log_emp | ~1.22 | ~1.15 | ~1.30 | +22% patents per unit log(emp) |
| firm_age | ~1.01 | ~1.006 | ~1.014 | +1% patents per year of age |
A 1-unit increase in log(R&D) — equivalent to a roughly 2.7-fold increase in R&D spending — is associated with 82% more expected patents. Since the covariate is in logs, the coefficient (0.60) is approximately the elasticity: a 1% increase in R&D spending is associated with a 0.60% increase in expected patents.
Step 6: Model Comparison and Diagnostics
# First-time setup: install.packages(c("topmodels"))
# Likelihood ratio test
lr_stat <- 2 * (logLik(negbin) - logLik(poisson))
p_value <- pchisq(as.numeric(lr_stat), df = 1, lower.tail = FALSE) / 2
cat("LR test stat:", lr_stat, " p-value:", p_value, "\n")
# AIC/BIC
cat("\nAIC - Poisson:", AIC(poisson), " NegBin:", AIC(negbin), "\n")
cat("BIC - Poisson:", BIC(poisson), " NegBin:", BIC(negbin), "\n")
# Rootogram (requires countreg or topmodels)
# install.packages("topmodels", repos = "https://R-Forge.R-project.org")
# library(topmodels)
# rootogram(negbin)Expected output:
| Test | Value |
|---|---|
| LR test statistic (Poisson vs NegBin) | > 500 |
| LR p-value (one-sided) | < 0.001 |
| AIC — Poisson | ~17,000 |
| AIC — NegBin | ~13,000 |
| BIC — Poisson | ~17,030 |
| BIC — NegBin | ~13,040 |
| Conclusion | NegBin strongly preferred |
A colleague suggests using OLS on patent counts (or log(patents+1)) instead of a count model. What are the main problems with this approach?
Step 7: Exercises
-
Zero-inflated model. If you suspect there is a separate process determining whether a firm patents at all (e.g., some firms never patent regardless of R&D), estimate a zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB). Use
statsmodels.discrete.count_model.ZeroInflatedPoisson(Python),pscl::zeroinfl(R), orzinb(Stata). -
Exposure variable. Add an offset for firm size:
offset(log_emp). This offset models the patent rate (patents per employee) instead of the patent count. How do the coefficients change? -
Robust Poisson. Re-estimate the Poisson model with robust (sandwich) standard errors. Compare these SEs to the negative binomial SEs. This comparison is the "quasi-Poisson" approach used widely in trade and innovation economics.
-
Prediction. For a firm with log_rd = 4, log_emp = 6, and firm_age = 20, compute the predicted patent count and its 95% prediction interval.
Summary
In this lab you learned:
- Count data require specialized models because OLS and log-OLS can produce biased and inconsistent estimates
- Poisson regression assumes Var(Y|X) = E(Y|X); test this with the variance-to-mean ratio or the Cameron-Trivedi test
- When overdispersion is present, the negative binomial model provides correct standard errors by estimating an extra dispersion parameter
- Incidence rate ratios (IRRs = exp(beta)) are the natural way to report results from count models
- Poisson with robust standard errors (quasi-Poisson) is a practical middle ground: Poisson regression requires correct specification of the conditional mean E[Y|X] = exp(Xbeta) for consistency, though standard errors remain valid under quasi-maximum likelihood estimation even if the variance function is misspecified
- Comparing observed vs. predicted count distributions (rootograms) is a valuable visual diagnostic