Lab: Poisson and Negative Binomial Regression
Model count data step by step. Learn to estimate Poisson and negative binomial regressions, test for overdispersion, compute incidence rate ratios, and choose between count models.
Overview
In this lab you will model firm-level patent counts using Poisson and negative binomial regressions. Patent data is a classic example of count data: non-negative integers with a right-skewed distribution and often more zeros than a Poisson distribution predicts. You will learn to diagnose overdispersion, compare models, and interpret results using incidence rate ratios.
What you will learn:
- Why OLS is inappropriate for count outcomes
- How to estimate a Poisson regression and interpret coefficients
- How to test for overdispersion and estimate a negative binomial model
- How to compute and interpret incidence rate ratios (IRRs)
- How to compare Poisson and negative binomial models formally
Prerequisites: Familiarity with OLS regression and maximum likelihood estimation concepts.
Step 1: Simulate and Explore Patent Count Data
We simulate 2,000 firms with R&D spending, firm size, and age, generating patent counts from a negative binomial distribution (so overdispersion is present by design).
library(MASS)
library(modelsummary)
library(AER)
set.seed(42)
n <- 2000
log_rd <- rnorm(n, 3, 1.2)
log_emp <- rnorm(n, 5, 1.5)
firm_age <- round(runif(n, 1, 50))
xb <- -1.0 + 0.6 * log_rd + 0.2 * log_emp + 0.01 * firm_age
mu <- exp(xb)
alpha <- 0.8
patents <- rnbinom(n, size = 1/alpha, mu = mu)
df <- data.frame(patents, log_rd, log_emp, firm_age)
cat("Mean:", mean(patents), "\n")
cat("Variance:", var(patents), "\n")
cat("Var/Mean:", var(patents)/mean(patents), "\n")
cat("Zeros:", sum(patents == 0), "\n")
par(mfrow = c(1, 2))
hist(patents, breaks = 50, main = "Patent Counts", xlab = "Patents")
hist(patents[patents > 0], breaks = 30,
main = "Conditional on Patenting", xlab = "Patents")Expected output:
| Statistic | Value |
|---|---|
| Sample size | 2,000 firms |
| Mean patents | ~8–15 |
| Variance of patents | ~80–250 |
| Var/Mean ratio | ~8–20 (severe overdispersion) |
| Zeros | ~150–300 (8–15%) |
| Max patents | ~80–150 |
Sample data preview (first 5 rows):
| firm | patents | log_rd | log_emp | firm_age |
|---|---|---|---|---|
| 0 | 3 | 2.87 | 6.12 | 24 |
| 1 | 12 | 4.01 | 4.55 | 38 |
| 2 | 0 | 1.55 | 3.82 | 7 |
| 3 | 25 | 3.92 | 7.01 | 45 |
| 4 | 1 | 2.14 | 5.43 | 12 |
Step 2: Estimate a Poisson Regression
# Poisson regression
poisson <- glm(patents ~ log_rd + log_emp + firm_age,
data = df, family = poisson())
summary(poisson)
# Interpretation
cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
round(exp(coef(poisson)["log_rd"]), 4), "\n")
cat("i.e., a", round((exp(coef(poisson)["log_rd"]) - 1) * 100, 1),
"% increase.\n")Expected output:
| Variable | Coefficient | SE | z | p-value | IRR (exp(coef)) |
|---|---|---|---|---|---|
| Intercept | ~-1.0 | ~0.10 | ~-10.0 | <0.001 | — |
| log_rd | ~0.60 | ~0.02 | ~30.0 | <0.001 | ~1.82 |
| log_emp | ~0.20 | ~0.01 | ~20.0 | <0.001 | ~1.22 |
| firm_age | ~0.01 | ~0.001 | ~10.0 | <0.001 | ~1.01 |
The true DGP coefficients are: intercept = -1.0, log_rd = 0.6, log_emp = 0.2, firm_age = 0.01. The Poisson estimates should be close to these values. The IRR for log_rd of approximately 1.82 means a 1-unit increase in log(R&D) is associated with an 82% increase in expected patents. However, the Poisson standard errors are too small because they ignore overdispersion.
Step 3: Test for Overdispersion
# Overdispersion test from AER package
disp_test <- dispersiontest(poisson, trafo = 1)
print(disp_test)
# Manual check: Pearson chi-squared / df
cat("\nPearson chi2 / df:", sum(residuals(poisson, type = "pearson")^2) /
poisson$df.residual, "\n")
cat("(Should be ~1 under correct Poisson specification)\n")Expected output:
| Test | Value |
|---|---|
| Cameron-Trivedi alpha estimate | ~0.5–1.0 |
| Cameron-Trivedi t-statistic | > 5.0 (highly significant) |
| Cameron-Trivedi p-value | < 0.001 |
| Pearson chi-squared / df | ~5–15 (well above 1) |
| Conclusion | Reject H0: significant overdispersion detected |
The Pearson chi-squared / df ratio far exceeds 1 (the Poisson expectation), confirming the variance is many times larger than the mean. This result is expected because the DGP uses a negative binomial with alpha = 0.8.
Your Poisson regression has a Pearson chi-squared / df ratio of 8.5. What does this tell you?
Step 4: Estimate a Negative Binomial Model
The negative binomial model adds a parameter to capture overdispersion.
# Negative binomial regression
negbin <- glm.nb(patents ~ log_rd + log_emp + firm_age, data = df)
summary(negbin)
# Compare coefficients
cat("\n=== Coefficient Comparison ===\n")
cat("Poisson coef on log_rd:", round(coef(poisson)["log_rd"], 4),
" SE:", round(summary(poisson)$coefficients["log_rd", "Std. Error"], 4), "\n")
cat("NegBin coef on log_rd:", round(coef(negbin)["log_rd"], 4),
" SE:", round(summary(negbin)$coefficients["log_rd", "Std. Error"], 4), "\n")
cat("\nTheta (1/alpha):", negbin$theta, "\n")
cat("Notice: NegBin SEs are larger (correctly reflecting overdispersion)\n")Expected output:
| Variable | Poisson Coef | Poisson SE | NegBin Coef | NegBin SE | True |
|---|---|---|---|---|---|
| Intercept | ~-1.00 | ~0.10 | ~-1.00 | ~0.18 | -1.0 |
| log_rd | ~0.60 | ~0.02 | ~0.60 | ~0.04 | 0.6 |
| log_emp | ~0.20 | ~0.01 | ~0.20 | ~0.03 | 0.2 |
| firm_age | ~0.01 | ~0.001 | ~0.01 | ~0.002 | 0.01 |
| Model Comparison | Poisson | NegBin |
|---|---|---|
| Log-likelihood | ~-8500 | ~-6500 |
| AIC | ~17008 | ~13010 |
| Alpha (dispersion) | — | ~0.5–1.2 (true: 0.8) |
Notice that the point estimates are nearly identical, but the NegBin standard errors are roughly 1.5–2.5 times larger than the Poisson SEs, correctly reflecting the extra variability from overdispersion.
Step 5: Incidence Rate Ratios
Exponentiating the coefficients gives incidence rate ratios (IRRs), which are the natural way to interpret count model results.
# Incidence Rate Ratios
irr <- exp(coef(negbin))
irr_ci <- exp(confint(negbin))
cat("=== Incidence Rate Ratios ===\n")
irr_table <- cbind(IRR = irr, irr_ci)
print(round(irr_table, 4))
cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
round(irr["log_rd"], 3), "\n")Expected output:
| Variable | IRR (exp(coef)) | 95% CI Lower | 95% CI Upper | Interpretation |
|---|---|---|---|---|
| log_rd | ~1.82 | ~1.68 | ~1.97 | +82% patents per unit log(R&D) |
| log_emp | ~1.22 | ~1.15 | ~1.30 | +22% patents per unit log(emp) |
| firm_age | ~1.01 | ~1.006 | ~1.014 | +1% patents per year of age |
A 1-unit increase in log(R&D) — equivalent to a roughly 2.7-fold increase in R&D spending — is associated with 82% more expected patents. Since the covariate is in logs, the coefficient (0.60) is approximately the elasticity: a 1% increase in R&D spending is associated with a 0.60% increase in expected patents.
Step 6: Model Comparison and Diagnostics
# Likelihood ratio test
lr_stat <- 2 * (logLik(negbin) - logLik(poisson))
p_value <- pchisq(as.numeric(lr_stat), df = 1, lower.tail = FALSE) / 2
cat("LR test stat:", lr_stat, " p-value:", p_value, "\n")
# AIC/BIC
cat("\nAIC - Poisson:", AIC(poisson), " NegBin:", AIC(negbin), "\n")
cat("BIC - Poisson:", BIC(poisson), " NegBin:", BIC(negbin), "\n")
# Rootogram (requires countreg or topmodels)
# install.packages("topmodels", repos = "https://R-Forge.R-project.org")
# library(topmodels)
# rootogram(negbin)Expected output:
| Test | Value |
|---|---|
| LR test statistic (Poisson vs NegBin) | > 500 |
| LR p-value (one-sided) | < 0.001 |
| AIC — Poisson | ~17,000 |
| AIC — NegBin | ~13,000 |
| BIC — Poisson | ~17,030 |
| BIC — NegBin | ~13,040 |
| Conclusion | NegBin strongly preferred |
A colleague suggests using OLS on patent counts (or log(patents+1)) instead of a count model. What are the main problems with this approach?
Step 7: Exercises
Try these on your own:
-
Zero-inflated model. If you suspect there is a separate process determining whether a firm patents at all (e.g., some firms never patent regardless of R&D), estimate a zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB). Use
statsmodels.discrete.count_model.ZeroInflatedPoisson(Python),pscl::zeroinfl(R), orzinb(Stata). -
Exposure variable. Add an offset for firm size:
offset(log_emp). This offset models the patent rate (patents per employee) instead of the patent count. How do the coefficients change? -
Robust Poisson. Re-estimate the Poisson model with robust (sandwich) standard errors. Compare these SEs to the negative binomial SEs. This comparison is the "quasi-Poisson" approach used widely in trade and innovation economics.
-
Prediction. For a firm with log_rd = 4, log_emp = 6, and firm_age = 20, compute the predicted patent count and its 95% prediction interval.
Summary
In this lab you learned:
- Count data require specialized models because OLS and log-OLS can produce biased and inconsistent estimates
- Poisson regression assumes Var(Y|X) = E(Y|X); test this with the variance-to-mean ratio or the Cameron-Trivedi test
- When overdispersion is present, the negative binomial model provides correct standard errors by estimating an extra dispersion parameter
- Incidence rate ratios (IRRs = exp(beta)) are the natural way to report results from count models
- Poisson with robust standard errors (quasi-Poisson) is a practical middle ground: Poisson regression requires correct specification of the conditional mean E[Y|X] = exp(Xbeta) for consistency, though standard errors remain valid under quasi-maximum likelihood estimation even if the variance function is misspecified
- Comparing observed vs. predicted count distributions (rootograms) is a valuable visual diagnostic