Lab·tutorial·7 min read

tutorial90 minutes

Lab: Poisson and Negative Binomial Regression

Model count data: estimate Poisson and negative binomial regressions, test overdispersion, compute incidence rate ratios, and choose between count models.

Method: Poisson / Negative Binomial
Languages: Python, R, Stata
Dataset: Simulated patent count data

Overview

In this lab you will model firm-level patent counts using Poisson and negative binomial regressions. Patent data are a classic example of count data: non-negative integers with a right-skewed distribution and often more zeros than a Poisson distribution predicts. You will learn to diagnose overdispersion, compare models, and interpret results using incidence rate ratios.

What you will learn:

Why OLS is inappropriate for count outcomes
How to estimate a Poisson regression and interpret coefficients
How to test for overdispersion and estimate a negative binomial model
How to compute and interpret incidence rate ratios (IRRs)
How to compare Poisson and negative binomial models formally

Prerequisites: Familiarity with OLS regression and maximum likelihood estimation concepts.

Step 1: Simulate and Explore Patent Count Data

We simulate 2,000 firms with R&D spending, firm size, and age, generating patent counts from a negative binomial distribution (so overdispersion is present by design).

1# First-time setup: install.packages(c("MASS", "modelsummary", "AER"))
2library(MASS)
3library(modelsummary)
4library(AER)
5
6set.seed(42)
7n <- 2000
8
9log_rd <- rnorm(n, 3, 1.2)
10log_emp <- rnorm(n, 5, 1.5)
11firm_age <- round(runif(n, 1, 50))
12
13xb <- -1.0 + 0.6 * log_rd + 0.2 * log_emp + 0.01 * firm_age
14mu <- exp(xb)
15
16alpha <- 0.8
17patents <- rnbinom(n, size = 1/alpha, mu = mu)
18
19df <- data.frame(patents, log_rd, log_emp, firm_age)
20
21cat("Mean:", mean(patents), "\n")
22cat("Variance:", var(patents), "\n")
23cat("Var/Mean:", var(patents)/mean(patents), "\n")
24cat("Zeros:", sum(patents == 0), "\n")
25
26par(mfrow = c(1, 2))
27hist(patents, breaks = 50, main = "Patent Counts", xlab = "Patents")
28hist(patents[patents > 0], breaks = 30,
29   main = "Conditional on Patenting", xlab = "Patents")

RequiresMASS modelsummary AER

Expected output:

Statistic	Value
Sample size	2,000 firms
Mean patents	~8–15
Variance of patents	~80–250
Var/Mean ratio	~8–20 (severe overdispersion)
Zeros	~150–300 (8–15%)
Max patents	~80–150

Sample data preview (first 5 rows):

firm	patents	log_rd	log_emp	firm_age
0	3	2.87	6.12	24
1	12	4.01	4.55	38
2	0	1.55	3.82	7
3	25	3.92	7.01	45
4	1	2.14	5.43	12

Step 2: Estimate a Poisson Regression

1# Poisson regression
2poisson <- glm(patents ~ log_rd + log_emp + firm_age,
3             data = df, family = poisson())
4
5summary(poisson)
6
7# Interpretation
8cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
9  round(exp(coef(poisson)["log_rd"]), 4), "\n")
10cat("i.e., a", round((exp(coef(poisson)["log_rd"]) - 1) * 100, 1),
11  "% increase.\n")

Expected output:

Variable	Coefficient	SE	z	p-value	IRR (exp(coef))
Intercept	~-1.0	~0.10	~-10.0	<0.001	—
log_rd	~0.60	~0.02	~30.0	<0.001	~1.82
log_emp	~0.20	~0.01	~20.0	<0.001	~1.22
firm_age	~0.01	~0.001	~10.0	<0.001	~1.01

The true DGP coefficients are: intercept = -1.0, log_rd = 0.6, log_emp = 0.2, firm_age = 0.01. The Poisson estimates should be close to these values. The IRR for log_rd of approximately 1.82 means a 1-unit increase in log(R&D) is associated with an 82% increase in expected patents. However, the Poisson standard errors are too small because they ignore overdispersion.

Step 3: Test for Overdispersion

1# Overdispersion test from AER package
2disp_test <- dispersiontest(poisson, trafo = 1)
3print(disp_test)
4
5# Manual check: Pearson chi-squared / df
6cat("\nPearson chi2 / df:", sum(residuals(poisson, type = "pearson")^2) /
7  poisson$df.residual, "\n")
8cat("(Should be ~1 under correct Poisson specification)\n")

RequiresAER

Expected output:

Test	Value
Cameron-Trivedi alpha estimate	~0.5–1.0
Cameron-Trivedi t-statistic	> 5.0 (highly significant)
Cameron-Trivedi p-value	< 0.001
Pearson chi-squared / df	~5–15 (well above 1)
Conclusion	Reject H0: significant overdispersion detected

The Pearson chi-squared / df ratio far exceeds 1 (the Poisson expectation), confirming the variance is many times larger than the mean. The overdispersion is expected because the DGP uses a negative binomial with alpha = 0.8.

Concept Check

Your Poisson regression has a Pearson chi-squared / df ratio of 8.5. What does this tell you?

The model fits the data perfectly.The Poisson assumption that Var(Y) = E(Y) is violated — there is substantial overdispersion, meaning the Poisson standard errors are too small.The coefficients are biased.You need to add more covariates.

Step 4: Estimate a Negative Binomial Model

The negative binomial model adds a parameter to capture overdispersion.

1# Negative binomial regression
2negbin <- glm.nb(patents ~ log_rd + log_emp + firm_age, data = df)
3summary(negbin)
4
5# Compare coefficients
6cat("\n=== Coefficient Comparison ===\n")
7cat("Poisson coef on log_rd:", round(coef(poisson)["log_rd"], 4),
8  " SE:", round(summary(poisson)$coefficients["log_rd", "Std. Error"], 4), "\n")
9cat("NegBin coef on log_rd:", round(coef(negbin)["log_rd"], 4),
10  " SE:", round(summary(negbin)$coefficients["log_rd", "Std. Error"], 4), "\n")
11
12cat("\nTheta (1/alpha):", negbin$theta, "\n")
13cat("Notice: NegBin SEs are larger (correctly reflecting overdispersion)\n")

Expected output:

Variable	Poisson Coef	Poisson SE	NegBin Coef	NegBin SE	True
Intercept	~-1.00	~0.10	~-1.00	~0.18	-1.0
log_rd	~0.60	~0.02	~0.60	~0.04	0.6
log_emp	~0.20	~0.01	~0.20	~0.03	0.2
firm_age	~0.01	~0.001	~0.01	~0.002	0.01

Model Comparison	Poisson	NegBin
Log-likelihood	~-8500	~-6500
AIC	~17008	~13010
Alpha (dispersion)	—	~0.5–1.2 (true: 0.8)

Notice that the point estimates are nearly identical, but the NegBin standard errors are roughly 1.5–2.5 times larger than the Poisson SEs, correctly reflecting the extra variability from overdispersion.

Step 5: Incidence Rate Ratios

Exponentiating the coefficients gives incidence rate ratios (IRRs), which are the natural way to interpret count model results.

1# Incidence Rate Ratios
2irr <- exp(coef(negbin))
3irr_ci <- exp(confint(negbin))
4
5cat("=== Incidence Rate Ratios ===\n")
6irr_table <- cbind(IRR = irr, irr_ci)
7print(round(irr_table, 4))
8
9cat("\nA 1-unit increase in log(R&D) multiplies expected patents by",
10  round(irr["log_rd"], 3), "\n")

Expected output:

Variable	IRR (exp(coef))	95% CI Lower	95% CI Upper	Interpretation
log_rd	~1.82	~1.68	~1.97	+82% patents per unit log(R&D)
log_emp	~1.22	~1.15	~1.30	+22% patents per unit log(emp)
firm_age	~1.01	~1.006	~1.014	+1% patents per year of age

A 1-unit increase in log(R&D) — equivalent to a roughly 2.7-fold increase in R&D spending — is associated with 82% more expected patents. Since the covariate is in logs, the coefficient (0.60) is approximately the elasticity: a 1% increase in R&D spending is associated with a 0.60% increase in expected patents.

Step 6: Model Comparison and Diagnostics

1# First-time setup: install.packages(c("topmodels"))
2# Likelihood ratio test
3lr_stat <- 2 * (logLik(negbin) - logLik(poisson))
4p_value <- pchisq(as.numeric(lr_stat), df = 1, lower.tail = FALSE) / 2
5cat("LR test stat:", lr_stat, "  p-value:", p_value, "\n")
6
7# AIC/BIC
8cat("\nAIC - Poisson:", AIC(poisson), " NegBin:", AIC(negbin), "\n")
9cat("BIC - Poisson:", BIC(poisson), " NegBin:", BIC(negbin), "\n")
10
11# Rootogram (requires countreg or topmodels)
12# install.packages("topmodels", repos = "https://R-Forge.R-project.org")
13# library(topmodels)
14# rootogram(negbin)

Expected output:

Test	Value
LR test statistic (Poisson vs NegBin)	> 500
LR p-value (one-sided)	< 0.001
AIC — Poisson	~17,000
AIC — NegBin	~13,000
BIC — Poisson	~17,030
BIC — NegBin	~13,040
Conclusion	NegBin strongly preferred

Concept Check

A colleague suggests using OLS on patent counts (or log(patents+1)) instead of a count model. What are the main problems with this approach?

OLS cannot handle non-negative outcomes.OLS assumes normally distributed errors, which is inappropriate for counts. log(y+1) introduces bias and is not a consistent estimator of the conditional mean parameters.OLS requires more data.It does not matter — OLS and Poisson give identical results.

Step 7: Exercises

Zero-inflated model. If you suspect there is a separate process determining whether a firm patents at all (e.g., some firms never patent regardless of R&D), estimate a zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB). Use statsmodels.discrete.count_model.ZeroInflatedPoisson (Python), pscl::zeroinfl (R), or zinb (Stata).
Exposure variable. Add an offset for firm size: offset(log_emp). This offset models the patent rate (patents per employee) instead of the patent count. How do the coefficients change?
Robust Poisson. Re-estimate the Poisson model with robust (sandwich) standard errors. Compare these SEs to the negative binomial SEs. This comparison is the "quasi-Poisson" approach used widely in trade and innovation economics.
Prediction. For a firm with log_rd = 4, log_emp = 6, and firm_age = 20, compute the predicted patent count and its 95% prediction interval.

Expected output

If your code runs correctly, expect to see:

Poisson coefficient on log_rd: Around 0.55–0.65 (true value: 0.60), interpreted as an IRR of approximately 1.7–1.9 (a doubling of R&D increases patents by 70–90%)
Poisson coefficient on log_emp: Around 0.15–0.25 (true value: 0.20)
Overdispersion test: Cameron-Trivedi or deviance/df ratio well above 1 (true alpha = 0.8), rejecting equidispersion
NegBin coefficients: Similar point estimates to Poisson but with larger standard errors (reflecting the overdispersion correction)
NegBin dispersion parameter (alpha): Around 0.5–1.2 (true value: 0.8)
LR test (Poisson vs. NegBin): Strongly rejects Poisson in favor of NegBin (p < 0.001)
Rootogram: Poisson underpredicts zeros and overpredicts middle counts; NegBin fits observed distribution better
Sample size: 2,000 firms throughout

Summary

In this lab you learned:

Count data require specialized models because OLS and log-OLS can produce biased and inconsistent estimates
Poisson regression assumes Var(Y|X) = E(Y|X); test this with the variance-to-mean ratio or the Cameron-Trivedi test
When overdispersion is present, the negative binomial model provides correct standard errors by estimating an extra dispersion parameter
Incidence rate ratios (IRRs = exp(beta)) are the natural way to report results from count models
Poisson with robust standard errors (quasi-Poisson) is a practical middle ground: Poisson regression requires correct specification of the conditional mean E[Y|X] = exp(Xbeta) for consistency, though standard errors remain valid under quasi-maximum likelihood estimation even if the variance function is misspecified
Comparing observed vs. predicted count distributions (rootograms) is a valuable visual diagnostic

Overview#

Step 1: Simulate and Explore Patent Count Data#

Step 2: Estimate a Poisson Regression#

Step 3: Test for Overdispersion#

Step 4: Estimate a Negative Binomial Model#

Step 5: Incidence Rate Ratios#

Step 6: Model Comparison and Diagnostics#

Step 7: Exercises#

Summary#

Overview

Step 1: Simulate and Explore Patent Count Data

Step 2: Estimate a Poisson Regression

Step 3: Test for Overdispersion

Step 4: Estimate a Negative Binomial Model

Step 5: Incidence Rate Ratios

Step 6: Model Comparison and Diagnostics

Step 7: Exercises

Summary