Lab: Cox Proportional Hazard Model from Scratch
Implement the Cox proportional hazard model step by step. Simulate survival data with right-censoring, fit the Cox PH model, interpret hazard ratios, test the proportional hazards assumption with Schoenfeld residuals, and plot Kaplan-Meier curves.
Overview
Survival analysis studies the time until an event occurs — CEO departure, firm failure, patent citation, or policy adoption. The Cox proportional hazard (PH) model is the workhorse estimator because it models how covariates shift the hazard rate without requiring assumptions about the baseline hazard shape.
What you will learn:
- How to simulate survival data with right-censoring
- How to plot and interpret Kaplan-Meier survival curves
- How to fit the Cox PH model and interpret hazard ratios
- Why right-censoring must be handled properly (and what goes wrong if you ignore it)
- How to test the proportional hazards assumption using Schoenfeld residuals
- How to compare survival curves across groups
Prerequisites: OLS regression, basic probability and distributions.
Step 1: Simulate Survival Data with Censoring
We simulate CEO tenure duration. Some CEOs are still in office when we observe them (right-censored). Ignoring censoring biases duration estimates downward.
library(survival)
library(survminer)
set.seed(42)
n <- 3000
# CEO and firm characteristics
founder <- rbinom(n, 1, 0.25) # Founder CEO (25%)
firm_size <- rnorm(n, mean = 7, sd = 1) # Log assets
board_ind <- runif(n, 0.3, 0.9) # Board independence (fraction)
prior_exp <- rpois(n, lambda = 2) # Prior CEO experience (count)
# True hazard ratios (HR > 1 means higher risk of departure)
# Founders stay longer (HR < 1), larger firms retain CEOs (HR < 1),
# more independent boards increase turnover (HR > 1)
true_hr_founder <- 0.5 # Founders have 50% lower hazard
true_hr_firm_size <- 0.85 # Larger firms: 15% lower hazard per unit
true_hr_board_ind <- 2.0 # More independent boards: 2x hazard
# Generate survival times from Weibull distribution
# The Cox model does not assume a specific distribution, but we need one for simulation
shape <- 1.5 # shape > 1: increasing hazard (CEOs more likely to leave over time)
scale <- 10 # baseline median tenure ~ 10 years
# Linear predictor (log-hazard scale)
lp <- log(true_hr_founder) * founder +
log(true_hr_firm_size) * firm_size +
log(true_hr_board_ind) * board_ind
# Weibull survival times with covariates
U <- runif(n)
tenure_true <- scale * (-log(U) * exp(-lp))^(1/shape)
# Right-censoring: observation window is 15 years
censor_time <- 15
tenure_obs <- pmin(tenure_true, censor_time)
event <- as.integer(tenure_true <= censor_time) # 1 = departed, 0 = censored
df <- data.frame(tenure = tenure_obs, event, founder, firm_size,
board_ind, prior_exp)
cat("Sample size:", n, "\n")
cat("Events (departures):", sum(event), "\n")
cat("Censored:", sum(1 - event), "\n")
cat("Censoring rate:", round(mean(1 - event), 3), "\n")
cat("\nMean observed tenure:", round(mean(tenure_obs), 2), "years\n")
cat("Mean true tenure (if fully observed):", round(mean(tenure_true), 2), "years\n")Expected output:
| Statistic | Value |
|---|---|
| Sample size | 3,000 |
| Events (departures) | ~2,200–2,500 |
| Censored | ~500–800 |
| Censoring rate | ~0.18–0.27 |
| Mean observed tenure | ~6–8 years |
| Mean true tenure | ~8–10 years |
The mean observed tenure is shorter than the true mean because censored observations truncate the long tenures. Ignoring censoring would systematically underestimate CEO tenure.
Step 2: Kaplan-Meier Survival Curves
Before fitting the Cox model, visualize the survival function using the nonparametric Kaplan-Meier estimator. This step makes no assumptions about the hazard shape.
# Overall KM curve
surv_obj <- Surv(df$tenure, df$event)
km_fit <- survfit(surv_obj ~ 1)
cat("Median survival time:", summary(km_fit)$table["median"], "years\n")
# KM curves by founder status
km_founder <- survfit(surv_obj ~ df$founder)
ggsurvplot(km_founder,
data = df,
legend.labs = c("Non-founder", "Founder"),
xlab = "Tenure (years)",
ylab = "Survival probability",
title = "CEO Tenure: Founder vs. Non-founder",
risk.table = TRUE)
# Log-rank test: is there a statistically significant difference?
lr_test <- survdiff(surv_obj ~ df$founder)
cat("\nLog-rank test p-value:", 1 - pchisq(lr_test$chisq, df = 1), "\n")Expected output:
| Group | Median Tenure (years) |
|---|---|
| All CEOs | ~7–9 |
| Founders | ~10–13 |
| Non-founders | ~6–8 |
The Kaplan-Meier curves should show that founder CEOs have substantially longer tenure (higher survival probability at every time point). The log-rank test should strongly reject the null of equal survival curves.
A CEO is still in office at the end of your 15-year observation window. In your dataset, this observation is right-censored. What does this tell us about this CEO's true tenure?
Step 3: Fit the Cox Proportional Hazard Model
The Cox PH model estimates how covariates shift the hazard rate multiplicatively, without specifying the baseline hazard function.
# Fit Cox PH model
cox_model <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind + prior_exp,
data = df)
summary(cox_model)
# Extract hazard ratios
cat("\n=== Hazard Ratios ===\n")
hr <- exp(coef(cox_model))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "founder", hr["founder"], true_hr_founder))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "firm_size", hr["firm_size"], true_hr_firm_size))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "board_ind", hr["board_ind"], true_hr_board_ind))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "prior_exp", hr["prior_exp"], 1.0))Expected output:
| Variable | Estimated HR | True HR | Interpretation |
|---|---|---|---|
| founder | ~0.50 | 0.50 | Founders have 50% lower departure hazard |
| firm_size | ~0.85 | 0.85 | Larger firms: 15% lower hazard per log-asset unit |
| board_ind | ~2.00 | 2.00 | Independent boards double the departure hazard |
| prior_exp | ~1.00 | 1.00 | No true effect (should be insignificant) |
The Cox model recovers hazard ratios close to the true values. A hazard ratio below 1 means lower risk of the event (longer survival); above 1 means higher risk (shorter survival).
Step 4: Interpret Hazard Ratios Carefully
# Predicted survival curves for specific profiles
# Profile 1: Non-founder, average firm, average board
# Profile 2: Founder, average firm, average board
new_data <- data.frame(
founder = c(0, 1),
firm_size = rep(mean(df$firm_size), 2),
board_ind = rep(mean(df$board_ind), 2),
prior_exp = rep(median(df$prior_exp), 2)
)
surv_pred <- survfit(cox_model, newdata = new_data)
cat("Predicted median tenure:\n")
cat(" Non-founder:", summary(surv_pred)$table[1, "median"], "years\n")
cat(" Founder: ", summary(surv_pred)$table[2, "median"], "years\n")
# Predicted 5-year and 10-year survival probabilities
s5 <- summary(surv_pred, times = 5)$surv
s10 <- summary(surv_pred, times = 10)$surv
cat("\n5-year survival probability:\n")
cat(" Non-founder:", round(s5[1], 3), "\n")
cat(" Founder: ", round(s5[2], 3), "\n")
cat("\n10-year survival probability:\n")
cat(" Non-founder:", round(s10[1], 3), "\n")
cat(" Founder: ", round(s10[2], 3), "\n")Expected output:
| Profile | Median Tenure | 5-year Survival | 10-year Survival |
|---|---|---|---|
| Non-founder (avg) | ~6–8 years | ~0.55–0.65 | ~0.20–0.35 |
| Founder (avg) | ~10–13 years | ~0.75–0.85 | ~0.45–0.60 |
Founders have substantially longer predicted tenure and higher survival probabilities at every time horizon, consistent with the hazard ratio of 0.50.
Step 5: Test the Proportional Hazards Assumption
The key assumption of the Cox model is that hazard ratios are constant over time. If the effect of a covariate changes over time, the PH assumption is violated.
# Schoenfeld residual test for PH assumption
ph_test <- cox.zph(cox_model)
print(ph_test)
cat("\nInterpretation:\n")
cat("p < 0.05 means the PH assumption is violated for that variable.\n")
cat("A non-significant global test means the PH assumption holds overall.\n")
# Plot Schoenfeld residuals over time for founder
# If PH holds, the smoothed line should be approximately flat
plot(ph_test[1], main = "Schoenfeld Residuals: Founder")
abline(h = coef(cox_model)["founder"], col = "red", lty = 2)Expected output:
| Variable | Test Statistic | p-value | PH holds? |
|---|---|---|---|
| founder | ~0.5–2.0 | > 0.05 | Yes |
| firm_size | ~0.5–2.0 | > 0.05 | Yes |
| board_ind | ~0.5–2.0 | > 0.05 | Yes |
| prior_exp | ~0.1–1.0 | > 0.05 | Yes |
| GLOBAL | ~2.0–5.0 | > 0.05 | Yes |
Since we simulated data with constant hazard ratios, the PH assumption should hold (all p-values > 0.05). The Schoenfeld residual plots should show approximately flat smoothed lines.
The Schoenfeld residual test shows a statistically significant result (p = 0.01) for the variable 'board_ind'. What does this mean and what should you do?
Step 6: What Happens If You Ignore Censoring
Let us demonstrate the bias that occurs when censoring is mishandled.
# Wrong approach 1: Treat censored as events (code as departed at censoring time)
df_wrong1 <- df
df_wrong1$event <- 1 # Pretend all CEOs departed
cox_wrong1 <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
data = df_wrong1)
# Wrong approach 2: Drop censored observations
df_wrong2 <- df[df$event == 1, ]
cox_wrong2 <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
data = df_wrong2)
# Correct approach
cox_correct <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
data = df)
cat("=== Hazard Ratios: Different Approaches ===\n")
cat(sprintf("%-20s %10s %10s %10s %10s\n",
"Variable", "Correct", "No Censor", "Drop Cens", "True HR"))
vars <- c("founder", "firm_size", "board_ind")
true_hrs <- c(true_hr_founder, true_hr_firm_size, true_hr_board_ind)
for (i in seq_along(vars)) {
cat(sprintf("%-20s %10.3f %10.3f %10.3f %10.3f\n",
vars[i],
exp(coef(cox_correct)[vars[i]]),
exp(coef(cox_wrong1)[vars[i]]),
exp(coef(cox_wrong2)[vars[i]]),
true_hrs[i]))
}Expected output:
| Variable | Correct HR | Ignore Censoring | Drop Censored | True HR |
|---|---|---|---|---|
| founder | ~0.50 | Biased | Biased | 0.50 |
| firm_size | ~0.85 | Biased | Biased | 0.85 |
| board_ind | ~2.00 | Biased | Biased | 2.00 |
Both wrong approaches produce biased hazard ratios. Treating censored observations as events underestimates the protective effect of founder status (biases the HR toward 1). Dropping censored observations creates selection bias by removing the longest-surviving CEOs.
Step 7: Model Diagnostics
# Concordance index (C-statistic)
# Measures discrimination: probability that a subject with higher
# predicted risk actually experiences the event sooner
cat("Concordance (C-index):", cox_model$concordance["concordance"], "\n")
cat("(0.5 = random, 1.0 = perfect discrimination)\n")
# Martingale residuals: check for nonlinearity
# Plot against continuous covariates; should show no pattern
mart_resid <- residuals(cox_model, type = "martingale")
plot(df$firm_size, mart_resid,
xlab = "Firm Size (log assets)", ylab = "Martingale Residual",
main = "Check for Nonlinearity: Firm Size")
lines(lowess(df$firm_size, mart_resid), col = "red", lwd = 2)
abline(h = 0, lty = 2)
# Deviance residuals: check for outliers
dev_resid <- residuals(cox_model, type = "deviance")
cat("\nDeviance residual range:", range(dev_resid), "\n")
cat("Observations with |deviance resid| > 3:", sum(abs(dev_resid) > 3), "\n")Expected output:
| Diagnostic | Value | Interpretation |
|---|---|---|
| C-index | ~0.65–0.75 | Reasonable discrimination |
| PH test (global) | p > 0.05 | PH assumption holds |
| Deviance residual range | ~[-2.5, 3.0] | No extreme outliers |
Step 8: Exercises
Interpreting Hazard Ratios
You estimate a Cox PH model of CEO tenure. The hazard ratio for 'board_ind' (board independence, ranging from 0 to 1) is 2.0, and the hazard ratio for 'founder' (binary, 0 or 1) is 0.5.
-
Time-varying covariates. Modify the simulation so that board independence increases midway through a CEO's tenure. Fit a model with time-varying board_ind.
-
Stratified Cox model. If the PH assumption fails for founder status, stratify the model by founder. This specification allows each group to have its own baseline hazard.
-
Parametric alternatives. Fit a Weibull or exponential model to the same data. Compare the hazard ratios with the Cox model. Since we simulated from a Weibull, the parametric model should be more efficient.
-
Competing risks. In practice, a CEO can depart for different reasons (forced out, retirement, moving to another firm). Modify the simulation to include competing risks and estimate cause-specific hazard models.
✓Key Takeaways
- Survival analysis studies the time until an event occurs, properly handling right-censoring (observations where the event has not yet occurred)
- The Kaplan-Meier estimator provides nonparametric survival curve estimates; the log-rank test compares curves across groups
- The Cox PH model estimates how covariates multiplicatively shift the hazard rate without assuming a specific baseline hazard shape
- Hazard ratios (HR) are the key output: HR < 1 means lower risk (longer survival), HR > 1 means higher risk (shorter survival)
- The proportional hazards assumption — that HRs are constant over time — should be tested using Schoenfeld residuals
- Ignoring censoring (treating censored as events or dropping them) produces biased estimates
- The concordance index (C-index) measures how well the model discriminates between individuals who experience the event sooner versus later
- Always report the number of events, censoring rate, and results of the PH assumption test