Lab·tutorial·6 min read

tutorial90 minutes

Lab: Cox Proportional Hazard Model from Scratch

Implement the Cox proportional hazard model step by step. Simulate survival data with right-censoring, fit the Cox PH model, interpret hazard ratios, test the proportional hazards assumption with Schoenfeld residuals, and plot Kaplan-Meier curves.

MethodCox Proportional Hazard Model

LanguagesPython, R, Stata

DatasetCEO tenure duration (simulated)

Overview

Survival analysis studies the time until an event occurs — CEO departure, firm failure, patent citation, or policy adoption. The Cox proportional hazard (PH) model is the workhorse estimator because it models how covariates shift the hazard rate without requiring assumptions about the baseline hazard shape.

What you will learn:

How to simulate survival data with right-censoring
How to plot and interpret Kaplan-Meier survival curves
How to fit the Cox PH model and interpret hazard ratios
Why right-censoring must be handled properly (and what goes wrong if you ignore it)
How to test the proportional hazards assumption using Schoenfeld residuals
How to compare survival curves across groups

Prerequisites: OLS regression, basic probability and distributions.

Step 1: Simulate Survival Data with Censoring

We simulate CEO tenure duration. Some CEOs are still in office when we observe them (right-censored). Ignoring censoring biases duration estimates downward.

1library(survival)
2library(survminer)
3
4set.seed(42)
5n <- 3000
6
7# CEO and firm characteristics
8founder   <- rbinom(n, 1, 0.25)       # Founder CEO (25%)
9firm_size <- rnorm(n, mean = 7, sd = 1) # Log assets
10board_ind <- runif(n, 0.3, 0.9)       # Board independence (fraction)
11prior_exp <- rpois(n, lambda = 2)      # Prior CEO experience (count)
12
13# True hazard ratios (HR > 1 means higher risk of departure)
14# Founders stay longer (HR < 1), larger firms retain CEOs (HR < 1),
15# more independent boards increase turnover (HR > 1)
16true_hr_founder   <- 0.5   # Founders have 50% lower hazard
17true_hr_firm_size <- 0.85  # Larger firms: 15% lower hazard per unit
18true_hr_board_ind <- 2.0   # More independent boards: 2x hazard
19
20# Generate survival times from Weibull distribution
21# The Cox model does not assume a specific distribution, but we need one for simulation
22shape <- 1.5  # shape > 1: increasing hazard (CEOs more likely to leave over time)
23scale <- 10   # baseline median tenure ~ 10 years
24
25# Linear predictor (log-hazard scale)
26lp <- log(true_hr_founder) * founder +
27    log(true_hr_firm_size) * firm_size +
28    log(true_hr_board_ind) * board_ind
29
30# Weibull survival times with covariates
31U <- runif(n)
32tenure_true <- scale * (-log(U) * exp(-lp))^(1/shape)
33
34# Right-censoring: observation window is 15 years
35censor_time <- 15
36tenure_obs  <- pmin(tenure_true, censor_time)
37event       <- as.integer(tenure_true <= censor_time)  # 1 = departed, 0 = censored
38
39df <- data.frame(tenure = tenure_obs, event, founder, firm_size,
40               board_ind, prior_exp)
41
42cat("Sample size:", n, "\n")
43cat("Events (departures):", sum(event), "\n")
44cat("Censored:", sum(1 - event), "\n")
45cat("Censoring rate:", round(mean(1 - event), 3), "\n")
46cat("\nMean observed tenure:", round(mean(tenure_obs), 2), "years\n")
47cat("Mean true tenure (if fully observed):", round(mean(tenure_true), 2), "years\n")

Expected output:

Statistic	Value
Sample size	3,000
Events (departures)	~2,200–2,500
Censored	~500–800
Censoring rate	~0.18–0.27
Mean observed tenure	~6–8 years
Mean true tenure	~8–10 years

The mean observed tenure is shorter than the true mean because censored observations truncate the long tenures. Ignoring censoring would systematically underestimate CEO tenure.

Step 2: Kaplan-Meier Survival Curves

Before fitting the Cox model, visualize the survival function using the nonparametric Kaplan-Meier estimator. This step makes no assumptions about the hazard shape.

1# Overall KM curve
2surv_obj <- Surv(df$tenure, df$event)
3km_fit <- survfit(surv_obj ~ 1)
4cat("Median survival time:", summary(km_fit)$table["median"], "years\n")
5
6# KM curves by founder status
7km_founder <- survfit(surv_obj ~ df$founder)
8ggsurvplot(km_founder,
9         data = df,
10         legend.labs = c("Non-founder", "Founder"),
11         xlab = "Tenure (years)",
12         ylab = "Survival probability",
13         title = "CEO Tenure: Founder vs. Non-founder",
14         risk.table = TRUE)
15
16# Log-rank test: is there a statistically significant difference?
17lr_test <- survdiff(surv_obj ~ df$founder)
18cat("\nLog-rank test p-value:", 1 - pchisq(lr_test$chisq, df = 1), "\n")

Expected output:

Group	Median Tenure (years)
All CEOs	~7–9
Founders	~10–13
Non-founders	~6–8

The Kaplan-Meier curves should show that founder CEOs have substantially longer tenure (higher survival probability at every time point). The log-rank test should strongly reject the null of equal survival curves.

Concept Check

A CEO is still in office at the end of your 15-year observation window. In your dataset, this observation is right-censored. What does this tell us about this CEO's true tenure?

The CEO's true tenure is exactly 15 years.We know the CEO's true tenure is at least 15 years, but we do not know the exact value.The observation should be dropped from the analysis because we do not know the outcome.The CEO's tenure should be coded as 15 years with the event indicator set to 1 (departed).

Step 3: Fit the Cox Proportional Hazard Model

The Cox PH model estimates how covariates shift the hazard rate multiplicatively, without specifying the baseline hazard function.

1# Fit Cox PH model
2cox_model <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind + prior_exp,
3                 data = df)
4summary(cox_model)
5
6# Extract hazard ratios
7cat("\n=== Hazard Ratios ===\n")
8hr <- exp(coef(cox_model))
9cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "founder", hr["founder"], true_hr_founder))
10cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "firm_size", hr["firm_size"], true_hr_firm_size))
11cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "board_ind", hr["board_ind"], true_hr_board_ind))
12cat(sprintf("%-12s HR = %.3f (true: %.3f)\n", "prior_exp", hr["prior_exp"], 1.0))

Expected output:

Variable	Estimated HR	True HR	Interpretation
founder	~0.50	0.50	Founders have 50% lower departure hazard
firm_size	~0.85	0.85	Larger firms: 15% lower hazard per log-asset unit
board_ind	~2.00	2.00	Independent boards double the departure hazard
prior_exp	~1.00	1.00	No true effect (should be insignificant)

The Cox model recovers hazard ratios close to the true values. A hazard ratio below 1 means lower risk of the event (longer survival); above 1 means higher risk (shorter survival).

Step 4: Interpret Hazard Ratios Carefully

1# Predicted survival curves for specific profiles
2# Profile 1: Non-founder, average firm, average board
3# Profile 2: Founder, average firm, average board
4new_data <- data.frame(
5founder   = c(0, 1),
6firm_size = rep(mean(df$firm_size), 2),
7board_ind = rep(mean(df$board_ind), 2),
8prior_exp = rep(median(df$prior_exp), 2)
9)
10
11surv_pred <- survfit(cox_model, newdata = new_data)
12
13cat("Predicted median tenure:\n")
14cat("  Non-founder:", summary(surv_pred)$table[1, "median"], "years\n")
15cat("  Founder:    ", summary(surv_pred)$table[2, "median"], "years\n")
16
17# Predicted 5-year and 10-year survival probabilities
18s5 <- summary(surv_pred, times = 5)$surv
19s10 <- summary(surv_pred, times = 10)$surv
20cat("\n5-year survival probability:\n")
21cat("  Non-founder:", round(s5[1], 3), "\n")
22cat("  Founder:    ", round(s5[2], 3), "\n")
23cat("\n10-year survival probability:\n")
24cat("  Non-founder:", round(s10[1], 3), "\n")
25cat("  Founder:    ", round(s10[2], 3), "\n")

Expected output:

Profile	Median Tenure	5-year Survival	10-year Survival
Non-founder (avg)	~6–8 years	~0.55–0.65	~0.20–0.35
Founder (avg)	~10–13 years	~0.75–0.85	~0.45–0.60

Founders have substantially longer predicted tenure and higher survival probabilities at every time horizon, consistent with the hazard ratio of 0.50.

Step 5: Test the Proportional Hazards Assumption

The key assumption of the Cox model is that hazard ratios are constant over time. If the effect of a covariate changes over time, the PH assumption is violated.

1# Schoenfeld residual test for PH assumption
2ph_test <- cox.zph(cox_model)
3print(ph_test)
4
5cat("\nInterpretation:\n")
6cat("p < 0.05 means the PH assumption is violated for that variable.\n")
7cat("A non-significant global test means the PH assumption holds overall.\n")
8
9# Plot Schoenfeld residuals over time for founder
10# If PH holds, the smoothed line should be approximately flat
11plot(ph_test[1], main = "Schoenfeld Residuals: Founder")
12abline(h = coef(cox_model)["founder"], col = "red", lty = 2)

Expected output:

Variable	Test Statistic	p-value	PH holds?
founder	~0.5–2.0	> 0.05	Yes
firm_size	~0.5–2.0	> 0.05	Yes
board_ind	~0.5–2.0	> 0.05	Yes
prior_exp	~0.1–1.0	> 0.05	Yes
GLOBAL	~2.0–5.0	> 0.05	Yes

Since we simulated data with constant hazard ratios, the PH assumption should hold (all p-values > 0.05). The Schoenfeld residual plots should show approximately flat smoothed lines.

Concept Check

The Schoenfeld residual test shows a statistically significant result (p = 0.01) for the variable 'board_ind'. What does this mean and what should you do?

The board independence variable should be dropped from the model.The effect of board independence on CEO departure hazard is not constant over time. You should consider a time-varying coefficient, stratification, or splitting the observation period.The Cox model is completely invalid and you should use a different method.The result is due to multiple testing and can be safely ignored.

Step 6: What Happens If You Ignore Censoring

Let us demonstrate the bias that occurs when censoring is mishandled.

1# Wrong approach 1: Treat censored as events (code as departed at censoring time)
2df_wrong1 <- df
3df_wrong1$event <- 1  # Pretend all CEOs departed
4cox_wrong1 <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
5                  data = df_wrong1)
6
7# Wrong approach 2: Drop censored observations
8df_wrong2 <- df[df$event == 1, ]
9cox_wrong2 <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
10                  data = df_wrong2)
11
12# Correct approach
13cox_correct <- coxph(Surv(tenure, event) ~ founder + firm_size + board_ind,
14                   data = df)
15
16cat("=== Hazard Ratios: Different Approaches ===\n")
17cat(sprintf("%-20s %10s %10s %10s %10s\n",
18  "Variable", "Correct", "No Censor", "Drop Cens", "True HR"))
19vars <- c("founder", "firm_size", "board_ind")
20true_hrs <- c(true_hr_founder, true_hr_firm_size, true_hr_board_ind)
21for (i in seq_along(vars)) {
22cat(sprintf("%-20s %10.3f %10.3f %10.3f %10.3f\n",
23    vars[i],
24    exp(coef(cox_correct)[vars[i]]),
25    exp(coef(cox_wrong1)[vars[i]]),
26    exp(coef(cox_wrong2)[vars[i]]),
27    true_hrs[i]))
28}

Expected output:

Variable	Correct HR	Ignore Censoring	Drop Censored	True HR
founder	~0.50	Biased	Biased	0.50
firm_size	~0.85	Biased	Biased	0.85
board_ind	~2.00	Biased	Biased	2.00

Both wrong approaches produce biased hazard ratios. Treating censored observations as events underestimates the protective effect of founder status (biases the HR toward 1). Dropping censored observations creates selection bias by removing the longest-surviving CEOs.

Step 7: Model Diagnostics

1# Concordance index (C-statistic)
2# Measures discrimination: probability that a subject with higher
3# predicted risk actually experiences the event sooner
4cat("Concordance (C-index):", cox_model$concordance["concordance"], "\n")
5cat("(0.5 = random, 1.0 = perfect discrimination)\n")
6
7# Martingale residuals: check for nonlinearity
8# Plot against continuous covariates; should show no pattern
9mart_resid <- residuals(cox_model, type = "martingale")
10plot(df$firm_size, mart_resid,
11   xlab = "Firm Size (log assets)", ylab = "Martingale Residual",
12   main = "Check for Nonlinearity: Firm Size")
13lines(lowess(df$firm_size, mart_resid), col = "red", lwd = 2)
14abline(h = 0, lty = 2)
15
16# Deviance residuals: check for outliers
17dev_resid <- residuals(cox_model, type = "deviance")
18cat("\nDeviance residual range:", range(dev_resid), "\n")
19cat("Observations with |deviance resid| > 3:", sum(abs(dev_resid) > 3), "\n")

Expected output:

Diagnostic	Value	Interpretation
C-index	~0.65–0.75	Reasonable discrimination
PH test (global)	p > 0.05	PH assumption holds
Deviance residual range	~[-2.5, 3.0]	No extreme outliers

Step 8: Exercises

Guided Exercise

Interpreting Hazard Ratios

You estimate a Cox PH model of CEO tenure. The hazard ratio for 'board_ind' (board independence, ranging from 0 to 1) is 2.0, and the hazard ratio for 'founder' (binary, 0 or 1) is 0.5.

Time-varying covariates. Modify the simulation so that board independence increases midway through a CEO's tenure. Fit a model with time-varying board_ind.
Stratified Cox model. If the PH assumption fails for founder status, stratify the model by founder. This specification allows each group to have its own baseline hazard.
Parametric alternatives. Fit a Weibull or exponential model to the same data. Compare the hazard ratios with the Cox model. Since we simulated from a Weibull, the parametric model should be more efficient.
Competing risks. In practice, a CEO can depart for different reasons (forced out, retirement, moving to another firm). Modify the simulation to include competing risks and estimate cause-specific hazard models.

✓Key Takeaways

Survival analysis studies the time until an event occurs, properly handling right-censoring (observations where the event has not yet occurred)
The Kaplan-Meier estimator provides nonparametric survival curve estimates; the log-rank test compares curves across groups
The Cox PH model estimates how covariates multiplicatively shift the hazard rate without assuming a specific baseline hazard shape
Hazard ratios (HR) are the key output: HR < 1 means lower risk (longer survival), HR > 1 means higher risk (shorter survival)
The proportional hazards assumption — that HRs are constant over time — should be tested using Schoenfeld residuals
Ignoring censoring (treating censored as events or dropping them) produces biased estimates
The concordance index (C-index) measures how well the model discriminates between individuals who experience the event sooner versus later
Always report the number of events, censoring rate, and results of the PH assumption test

Overview#

Step 1: Simulate Survival Data with Censoring#

Step 2: Kaplan-Meier Survival Curves#

Step 3: Fit the Cox Proportional Hazard Model#

Step 4: Interpret Hazard Ratios Carefully#

Step 5: Test the Proportional Hazards Assumption#

Step 6: What Happens If You Ignore Censoring#

Step 7: Model Diagnostics#

Step 8: Exercises#

✓Key Takeaways#

Overview

Step 1: Simulate Survival Data with Censoring

Step 2: Kaplan-Meier Survival Curves

Step 3: Fit the Cox Proportional Hazard Model

Step 4: Interpret Hazard Ratios Carefully

Step 5: Test the Proportional Hazards Assumption

Step 6: What Happens If You Ignore Censoring

Step 7: Model Diagnostics

Step 8: Exercises

✓Key Takeaways