MethodAtlas
Lab·tutorial·6 min read
tutorial90 minutes

Lab: Cox Proportional Hazard Model from Scratch

Implement the Cox proportional hazard model step by step. Simulate survival data with right-censoring, fit the Cox PH model, interpret hazard ratios, test the proportional hazards assumption with Schoenfeld residuals, and plot Kaplan-Meier curves.

LanguagesPython, R, Stata
DatasetCEO tenure duration (simulated)

Overview

Survival analysis studies the time until an event occurs — CEO departure, firm failure, patent citation, or policy adoption. The Cox proportional hazard (PH) model is the workhorse estimator because it models how covariates shift the hazard rate without requiring assumptions about the baseline hazard shape.

What you will learn:

  • How to simulate survival data with right-censoring
  • How to plot and interpret Kaplan-Meier survival curves
  • How to fit the Cox PH model and interpret hazard ratios
  • Why right-censoring must be handled properly (and what goes wrong if you ignore it)
  • How to test the proportional hazards assumption using Schoenfeld residuals
  • How to compare survival curves across groups

Prerequisites: OLS regression, basic probability and distributions.


Step 1: Simulate Survival Data with Censoring

We simulate CEO tenure duration. Some CEOs are still in office when we observe them (right-censored). Ignoring censoring biases duration estimates downward.

library(survival)
library(survminer)

set.seed(42)
n <- 3000

# CEO and firm characteristics
founder   <- rbinom(n, 1, 0.25)       # Founder CEO (25%)
firm_size <- rnorm(n, mean = 7, sd = 1) # Log assets
board_ind <- runif(n, 0.3, 0.9)       # Board independence (fraction)
prior_exp <- rpois(n, lambda = 2)      # Prior CEO experience (count)

# True hazard ratios (HR > 1 means higher risk of departure)
# Founders stay longer (HR < 1), larger firms retain CEOs (HR < 1),
# more independent boards increase turnover (HR > 1)
true_hr_founder   <- 0.5   # Founders have 50% lower hazard
true_hr_firm_size <- 0.85  # Larger firms: 15% lower hazard per unit
true_hr_board_ind <- 2.0   # More independent boards: 2x hazard

# Generate survival times from Weibull distribution
# The Cox model does not assume a specific distribution, but we need one for simulation
shape <- 1.5  # shape > 1: increasing hazard (CEOs more likely to leave over time)
scale <- 10   # baseline median tenure ~ 10 years

# Linear predictor (log-hazard scale)
lp <- log(true_hr_founder) * founder +
    log(true_hr_firm_size) * firm_size +
    log(true_hr_board_ind) * board_ind

# Weibull survival times with covariates
U <- runif(n)
tenure_true <- scale * (-log(U) * exp(-lp))^(1/shape)

# Right-censoring: observation window is 15 years
censor_time <- 15
tenure_obs  <- pmin(tenure_true, censor_time)
event       <- as.integer(tenure_true <= censor_time)  # 1 = departed, 0 = censored

df <- data.frame(tenure = tenure_obs, event, founder, firm_size,
               board_ind, prior_exp)

cat("Sample size:", n, "\n")
cat("Events (departures):", sum(event), "\n")
cat("Censored:", sum(1 - event), "\n")
cat("Censoring rate:", round(mean(1 - event), 3), "\n")
cat("\nMean observed tenure:", round(mean(tenure_obs), 2), "years\n")
cat("Mean true tenure (if fully observed):", round(mean(tenure_true), 2), "years\n")

Expected output:

StatisticValue
Sample size3,000
Events (departures)~2,200–2,500
Censored~500–800
Censoring rate~0.18–0.27
Mean observed tenure~6–8 years
Mean true tenure~8–10 years

The mean observed tenure is shorter than the true mean because censored observations truncate the long tenures. Ignoring censoring would systematically underestimate CEO tenure.


Step 2: Kaplan-Meier Survival Curves

Before fitting the Cox model, visualize the survival function using the nonparametric Kaplan-Meier estimator. This step makes no assumptions about the hazard shape.

# Overall KM curve
surv_obj <- Surv(df$tenure, df$event)
km_fit <- survfit(surv_obj ~ 1)
cat("Median survival time:", summary(km_fit)$table["median"], "years\n")

# KM curves by founder status
km_founder <- survfit(surv_obj ~ df$founder)
ggsurvplot(km_founder,
         data = df,
         legend.labs = c("Non-founder", "Founder"),
         xlab = "Tenure (years)",
         ylab = "Survival probability",
         title = "CEO Tenure: Founder vs. Non-founder",
         risk.table = TRUE)

# Log-rank test: is there a statistically significant difference?
lr_test <- survdiff(surv_obj ~ df$founder)
cat("\nLog-rank test p-value:", 1 - pchisq(lr_test$chisq, df = 1), "\n")

Expected output:

GroupMedian Tenure (years)
All CEOs~7–9
Founders~10–13
Non-founders~6–8

The Kaplan-Meier curves should show that founder CEOs have substantially longer tenure (higher survival probability at every time point). The log-rank test should strongly reject the null of equal survival curves.

Concept Check

A CEO is still in office at the end of your 15-year observation window. In your dataset, this observation is right-censored. What does this tell us about this CEO's true tenure?


Step 3: Fit the Cox Proportional Hazard Model

The Cox PH model estimates how covariates shift the hazard rate multiplicatively, without specifying the baseline hazard function.

# Fit Cox PH model
cox_model <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind + prior_exp,
                 data = df)
summary(cox_model)

# Extract hazard ratios
cat("\n=== Hazard Ratios ===\n")
hr <- exp(coef(cox_model))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "founder", hr["founder"], true_hr_founder))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "firm_size", hr["firm_size"], true_hr_firm_size))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "board_ind", hr["board_ind"], true_hr_board_ind))
cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "prior_exp", hr["prior_exp"], 1.0))

Expected output:

VariableEstimated HRTrue HRInterpretation
founder~0.500.50Founders have 50% lower departure hazard
firm_size~0.850.85Larger firms: 15% lower hazard per log-asset unit
board_ind~2.002.00Independent boards double the departure hazard
prior_exp~1.001.00No true effect (should be insignificant)

The Cox model recovers hazard ratios close to the true values. A hazard ratio below 1 means lower risk of the event (longer survival); above 1 means higher risk (shorter survival).


Step 4: Interpret Hazard Ratios Carefully

# Predicted survival curves for specific profiles
# Profile 1: Non-founder, average firm, average board
# Profile 2: Founder, average firm, average board
new_data <- data.frame(
founder   = c(0, 1),
firm_size = rep(mean(df$firm_size), 2),
board_ind = rep(mean(df$board_ind), 2),
prior_exp = rep(median(df$prior_exp), 2)
)

surv_pred <- survfit(cox_model, newdata = new_data)

cat("Predicted median tenure:\n")
cat("  Non-founder:", summary(surv_pred)$table[1, "median"], "years\n")
cat("  Founder:    ", summary(surv_pred)$table[2, "median"], "years\n")

# Predicted 5-year and 10-year survival probabilities
s5 <- summary(surv_pred, times = 5)$surv
s10 <- summary(surv_pred, times = 10)$surv
cat("\n5-year survival probability:\n")
cat("  Non-founder:", round(s5[1], 3), "\n")
cat("  Founder:    ", round(s5[2], 3), "\n")
cat("\n10-year survival probability:\n")
cat("  Non-founder:", round(s10[1], 3), "\n")
cat("  Founder:    ", round(s10[2], 3), "\n")

Expected output:

ProfileMedian Tenure5-year Survival10-year Survival
Non-founder (avg)~6–8 years~0.55–0.65~0.20–0.35
Founder (avg)~10–13 years~0.75–0.85~0.45–0.60

Founders have substantially longer predicted tenure and higher survival probabilities at every time horizon, consistent with the hazard ratio of 0.50.


Step 5: Test the Proportional Hazards Assumption

The key assumption of the Cox model is that hazard ratios are constant over time. If the effect of a covariate changes over time, the PH assumption is violated.

# Schoenfeld residual test for PH assumption
ph_test <- cox.zph(cox_model)
print(ph_test)

cat("\nInterpretation:\n")
cat("p < 0.05 means the PH assumption is violated for that variable.\n")
cat("A non-significant global test means the PH assumption holds overall.\n")

# Plot Schoenfeld residuals over time for founder
# If PH holds, the smoothed line should be approximately flat
plot(ph_test[1], main = "Schoenfeld Residuals: Founder")
abline(h = coef(cox_model)["founder"], col = "red", lty = 2)

Expected output:

VariableTest Statisticp-valuePH holds?
founder~0.5–2.0> 0.05Yes
firm_size~0.5–2.0> 0.05Yes
board_ind~0.5–2.0> 0.05Yes
prior_exp~0.1–1.0> 0.05Yes
GLOBAL~2.0–5.0> 0.05Yes

Since we simulated data with constant hazard ratios, the PH assumption should hold (all p-values > 0.05). The Schoenfeld residual plots should show approximately flat smoothed lines.

Concept Check

The Schoenfeld residual test shows a statistically significant result (p = 0.01) for the variable 'board_ind'. What does this mean and what should you do?


Step 6: What Happens If You Ignore Censoring

Let us demonstrate the bias that occurs when censoring is mishandled.

# Wrong approach 1: Treat censored as events (code as departed at censoring time)
df_wrong1 <- df
df_wrong1$event <- 1  # Pretend all CEOs departed
cox_wrong1 <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
                  data = df_wrong1)

# Wrong approach 2: Drop censored observations
df_wrong2 <- df[df$event == 1, ]
cox_wrong2 <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
                  data = df_wrong2)

# Correct approach
cox_correct <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
                   data = df)

cat("=== Hazard Ratios: Different Approaches ===\n")
cat(sprintf("%-20s %10s %10s %10s %10s\n",
  "Variable", "Correct", "No Censor", "Drop Cens", "True HR"))
vars <- c("founder", "firm_size", "board_ind")
true_hrs <- c(true_hr_founder, true_hr_firm_size, true_hr_board_ind)
for (i in seq_along(vars)) {
cat(sprintf("%-20s %10.3f %10.3f %10.3f %10.3f\n",
    vars[i],
    exp(coef(cox_correct)[vars[i]]),
    exp(coef(cox_wrong1)[vars[i]]),
    exp(coef(cox_wrong2)[vars[i]]),
    true_hrs[i]))
}

Expected output:

VariableCorrect HRIgnore CensoringDrop CensoredTrue HR
founder~0.50BiasedBiased0.50
firm_size~0.85BiasedBiased0.85
board_ind~2.00BiasedBiased2.00

Both wrong approaches produce biased hazard ratios. Treating censored observations as events underestimates the protective effect of founder status (biases the HR toward 1). Dropping censored observations creates selection bias by removing the longest-surviving CEOs.


Step 7: Model Diagnostics

# Concordance index (C-statistic)
# Measures discrimination: probability that a subject with higher
# predicted risk actually experiences the event sooner
cat("Concordance (C-index):", cox_model$concordance["concordance"], "\n")
cat("(0.5 = random, 1.0 = perfect discrimination)\n")

# Martingale residuals: check for nonlinearity
# Plot against continuous covariates; should show no pattern
mart_resid <- residuals(cox_model, type = "martingale")
plot(df$firm_size, mart_resid,
   xlab = "Firm Size (log assets)", ylab = "Martingale Residual",
   main = "Check for Nonlinearity: Firm Size")
lines(lowess(df$firm_size, mart_resid), col = "red", lwd = 2)
abline(h = 0, lty = 2)

# Deviance residuals: check for outliers
dev_resid <- residuals(cox_model, type = "deviance")
cat("\nDeviance residual range:", range(dev_resid), "\n")
cat("Observations with |deviance resid| > 3:", sum(abs(dev_resid) > 3), "\n")

Expected output:

DiagnosticValueInterpretation
C-index~0.65–0.75Reasonable discrimination
PH test (global)p > 0.05PH assumption holds
Deviance residual range~[-2.5, 3.0]No extreme outliers

Step 8: Exercises

Guided Exercise

Interpreting Hazard Ratios

You estimate a Cox PH model of CEO tenure. The hazard ratio for 'board_ind' (board independence, ranging from 0 to 1) is 2.0, and the hazard ratio for 'founder' (binary, 0 or 1) is 0.5.

A 0.1 increase in board independence multiplies the departure hazard by what factor? (Hint: HR^0.1)

By what percentage is a founder's departure hazard lower than a non-founder's? (answer as a number, e.g., 50)

If the median tenure for non-founders is 7 years, would you expect the median tenure for founders to be (a) exactly 14 years, (b) more than 7 but not necessarily 14, or (c) less than 7? (answer a, b, or c)

  1. Time-varying covariates. Modify the simulation so that board independence increases midway through a CEO's tenure. Fit a model with time-varying board_ind.

  2. Stratified Cox model. If the PH assumption fails for founder status, stratify the model by founder. This specification allows each group to have its own baseline hazard.

  3. Parametric alternatives. Fit a Weibull or exponential model to the same data. Compare the hazard ratios with the Cox model. Since we simulated from a Weibull, the parametric model should be more efficient.

  4. Competing risks. In practice, a CEO can depart for different reasons (forced out, retirement, moving to another firm). Modify the simulation to include competing risks and estimate cause-specific hazard models.


Key Takeaways

  • Survival analysis studies the time until an event occurs, properly handling right-censoring (observations where the event has not yet occurred)
  • The Kaplan-Meier estimator provides nonparametric survival curve estimates; the log-rank test compares curves across groups
  • The Cox PH model estimates how covariates multiplicatively shift the hazard rate without assuming a specific baseline hazard shape
  • Hazard ratios (HR) are the key output: HR < 1 means lower risk (longer survival), HR > 1 means higher risk (shorter survival)
  • The proportional hazards assumption — that HRs are constant over time — should be tested using Schoenfeld residuals
  • Ignoring censoring (treating censored as events or dropping them) produces biased estimates
  • The concordance index (C-index) measures how well the model discriminates between individuals who experience the event sooner versus later
  • Always report the number of events, censoring rate, and results of the PH assumption test