27 Survival analysis
27.1 Overview
Survival analysis focuses on describing for a given individual or group of individuals, a defined point of event called the failure (occurrence of a disease, cure from a disease, death, relapse after response to treatment…) that occurs after a period of time called failure time (or follow-up time in cohort/population-based studies) during which individuals are observed. To determine the failure time, it is then necessary to define a time of origin (that can be the inclusion date, the date of diagnosis…).
The target of inference for survival analysis is then the time between an origin and an event. In current medical research, it is widely used in clinical studies to assess the effect of a treatment for instance, or in cancer epidemiology to assess a large variety of cancer survival measures.
It is usually expressed through the survival probability which is the probability that the event of interest has not occurred by a duration t.
Censoring: Censoring occurs when at the end of follow-up, some of the individuals have not had the event of interest, and thus their true time to event is unknown. We will mostly focus on right censoring here but for more details on censoring and survival analysis in general, you can see references.
27.2 Preparation
Load packages
To run survival analyses in R, one the most widely used package is the survival package. We first install it and then load it as well as the other packages that will be used in this section:
In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on R basics for more information on R packages.
This page explores survival analyses using the linelist used in most of the previous pages and on which we apply some changes to have a proper survival data.
Import dataset
We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import()
function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).
# import linelist
<- rio::import("linelist_cleaned.rds") linelist_case_data
Data management and transformation
In short, survival data can be described as having the following three characteristics:
- the dependent variable or response is the waiting time until the occurrence of a well-defined event,
- observations are censored, in the sense that for some units the event of interest has not occurred at the time the data are analyzed, and
- there are predictors or explanatory variables whose effect on the waiting time we wish to assess or control.
Thus, we will create different variables needed to respect that structure and run the survival analysis.
We define:
- a new data frame
linelist_surv
for this analysis
- our event of interest as being “death” (hence our survival probability will be the probability of being alive after a certain time after the time of origin),
- the follow-up time (
futime
) as the time between the time of onset and the time of outcome in days, - censored patients as those who recovered or for whom the final outcome is not known ie the event “death” was not observed (
event=0
).
CAUTION: Since in a real cohort study, the information on the time of origin and the end of the follow-up is known given individuals are observed, we will remove observations where the date of onset or the date of outcome is unknown. Also the cases where the date of onset is later than the date of outcome will be removed since they are considered as wrong.
TIP: Given that filtering to greater than (>) or less than (<) a date can remove rows with missing values, applying the filter on the wrong dates will also remove the rows with missing dates.
We then use case_when()
to create a column age_cat_small
in which there are only 3 age categories.
#create a new data called linelist_surv from the linelist_case_data
<- linelist_case_data %>%
linelist_surv
::filter(
dplyr# remove observations with wrong or missing dates of onset or date of outcome
> date_onset) %>%
date_outcome
::mutate(
dplyr# create the event var which is 1 if the patient died and 0 if he was right censored
event = ifelse(is.na(outcome) | outcome == "Recover", 0, 1),
# create the var on the follow-up time in days
futime = as.double(date_outcome - date_onset),
# create a new age category variable with only 3 strata levels
age_cat_small = dplyr::case_when(
< 5 ~ "0-4",
age_years >= 5 & age_years < 20 ~ "5-19",
age_years >= 20 ~ "20+"),
age_years
# previous step created age_cat_small var as character.
# now convert it to factor and specify the levels.
# Note that the NA values remain NA's and are not put in a level "unknown" for example,
# since in the next analyses they have to be removed.
age_cat_small = fct_relevel(age_cat_small, "0-4", "5-19", "20+")
)
TIP: We can verify the new columns we have created by doing a summary on the futime
and a cross-tabulation between event
and outcome
from which it was created. Besides this verification it is a good habit to communicate the median follow-up time when interpreting survival analysis results.
summary(linelist_surv$futime)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 6.00 10.00 11.98 16.00 64.00
# cross tabulate the new event var and the outcome var from which it was created
# to make sure the code did what it was intended to
%>%
linelist_surv tabyl(outcome, event)
outcome 0 1
Death 0 1952
Recover 1547 0
<NA> 1040 0
Now we cross-tabulate the new age_cat_small var and the old age_cat col to ensure correct assingments
%>%
linelist_surv tabyl(age_cat_small, age_cat)
age_cat_small 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
0-4 834 0 0 0 0 0 0 0 0
5-19 0 852 717 575 0 0 0 0 0
20+ 0 0 0 0 862 554 69 5 0
<NA> 0 0 0 0 0 0 0 0 71
Now we review the 10 first observations of the linelist_surv
data looking at specific variables (including those newly created).
%>%
linelist_surv select(case_id, age_cat_small, date_onset, date_outcome, outcome, event, futime) %>%
head(10)
case_id age_cat_small date_onset date_outcome outcome event futime
1 8689b7 0-4 2014-05-13 2014-05-18 Recover 0 5
2 11f8ea 20+ 2014-05-16 2014-05-30 Recover 0 14
3 893f25 0-4 2014-05-21 2014-05-29 Recover 0 8
4 be99c8 5-19 2014-05-22 2014-05-24 Recover 0 2
5 07e3e8 5-19 2014-05-27 2014-06-01 Recover 0 5
6 369449 0-4 2014-06-02 2014-06-07 Death 1 5
7 f393b4 20+ 2014-06-05 2014-06-18 Recover 0 13
8 1389ca 20+ 2014-06-05 2014-06-09 Death 1 4
9 2978ac 5-19 2014-06-06 2014-06-15 Death 1 9
10 fc15ef 5-19 2014-06-16 2014-07-09 Recover 0 23
We can also cross-tabulate the columns age_cat_small
and gender
to have more details on the distribution of this new column by gender. We use tabyl()
and the adorn functions from janitor as described in the Descriptive tables page.
%>%
linelist_surv tabyl(gender, age_cat_small, show_na = F) %>%
adorn_totals(where = "both") %>%
adorn_percentages() %>%
adorn_pct_formatting() %>%
adorn_ns(position = "front")
gender 0-4 5-19 20+ Total
f 482 (22.4%) 1,184 (54.9%) 490 (22.7%) 2,156 (100.0%)
m 325 (15.0%) 880 (40.6%) 960 (44.3%) 2,165 (100.0%)
Total 807 (18.7%) 2,064 (47.8%) 1,450 (33.6%) 4,321 (100.0%)
27.3 Basics of survival analysis
Building a surv-type object
We will first use Surv()
from survival to build a survival object from the follow-up time and event columns.
The result of such a step is to produce an object of type Surv that condenses the time information and whether the event of interest (death) was observed. This object will ultimately be used in the right-hand side of subsequent model formulae (see documentation).
# Use Suv() syntax for right-censored data
<- Surv(time = linelist_surv$futime,
survobj event = linelist_surv$event)
To review, here are the first 10 rows of the linelist_surv
data, viewing only some important columns.
%>%
linelist_surv select(case_id, date_onset, date_outcome, futime, outcome, event) %>%
head(10)
case_id date_onset date_outcome futime outcome event
1 8689b7 2014-05-13 2014-05-18 5 Recover 0
2 11f8ea 2014-05-16 2014-05-30 14 Recover 0
3 893f25 2014-05-21 2014-05-29 8 Recover 0
4 be99c8 2014-05-22 2014-05-24 2 Recover 0
5 07e3e8 2014-05-27 2014-06-01 5 Recover 0
6 369449 2014-06-02 2014-06-07 5 Death 1
7 f393b4 2014-06-05 2014-06-18 13 Recover 0
8 1389ca 2014-06-05 2014-06-09 4 Death 1
9 2978ac 2014-06-06 2014-06-15 9 Death 1
10 fc15ef 2014-06-16 2014-07-09 23 Recover 0
And here are the first 10 elements of survobj
. It prints as essentially a vector of follow-up time, with “+” to represent if an observation was right-censored. See how the numbers align above and below.
#print the 50 first elements of the vector to see how it presents
head(survobj, 10)
[1] 5+ 14+ 8+ 2+ 5+ 5 13+ 4 9 23+
Running initial analyses
We then start our analysis using the survfit()
function to produce a survfit object, which fits the default calculations for Kaplan Meier (KM) estimates of the overall (marginal) survival curve, which are in fact a step function with jumps at observed event times. The final survfit object contains one or more survival curves and is created using the Surv object as a response variable in the model formula.
NOTE: The Kaplan-Meier estimate is a nonparametric maximum likelihood estimate (MLE) of the survival function. . (see resources for more information).
The summary of this survfit object will give what is called a life table. For each time step of the follow-up (time
) where an event happened (in ascending order):
- the number of people who were at risk of developing the event (people who did not have the event yet nor were censored:
n.risk
)
- those who did develop the event (
n.event
)
- and from the above: the probability of not developing the event (probability of not dying, or of surviving past that specific time)
- finally, the standard error and the confidence interval for that probability are derived and displayed
We fit the KM estimates using the formula where the previously Surv object “survobj” is the response variable. “~ 1” precises we run the model for the overall survival.
# fit the KM estimates using a formula where the Surv object "survobj" is the response variable.
# "~ 1" signifies that we run the model for the overall survival
<- survival::survfit(survobj ~ 1)
linelistsurv_fit
#print its summary for more details
summary(linelistsurv_fit)
Call: survfit(formula = survobj ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 4539 30 0.993 0.00120 0.991 0.996
2 4500 69 0.978 0.00217 0.974 0.982
3 4394 149 0.945 0.00340 0.938 0.952
4 4176 194 0.901 0.00447 0.892 0.910
5 3899 214 0.852 0.00535 0.841 0.862
6 3592 210 0.802 0.00604 0.790 0.814
7 3223 179 0.757 0.00656 0.745 0.770
8 2899 167 0.714 0.00700 0.700 0.728
9 2593 145 0.674 0.00735 0.660 0.688
10 2311 109 0.642 0.00761 0.627 0.657
11 2081 119 0.605 0.00788 0.590 0.621
12 1843 89 0.576 0.00809 0.560 0.592
13 1608 55 0.556 0.00823 0.540 0.573
14 1448 43 0.540 0.00837 0.524 0.556
15 1296 31 0.527 0.00848 0.511 0.544
16 1152 48 0.505 0.00870 0.488 0.522
17 1002 29 0.490 0.00886 0.473 0.508
18 898 21 0.479 0.00900 0.462 0.497
19 798 7 0.475 0.00906 0.457 0.493
20 705 4 0.472 0.00911 0.454 0.490
21 626 13 0.462 0.00932 0.444 0.481
22 546 8 0.455 0.00948 0.437 0.474
23 481 5 0.451 0.00962 0.432 0.470
24 436 4 0.447 0.00975 0.428 0.466
25 378 4 0.442 0.00993 0.423 0.462
26 336 3 0.438 0.01010 0.419 0.458
27 297 1 0.436 0.01017 0.417 0.457
29 235 1 0.435 0.01030 0.415 0.455
38 73 1 0.429 0.01175 0.406 0.452
While using summary()
we can add the option times
and specify certain times at which we want to see the survival information
#print its summary at specific times
summary(linelistsurv_fit, times = c(5,10,20,30,60))
Call: survfit(formula = survobj ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 3899 656 0.852 0.00535 0.841 0.862
10 2311 810 0.642 0.00761 0.627 0.657
20 705 446 0.472 0.00911 0.454 0.490
30 210 39 0.435 0.01030 0.415 0.455
60 2 1 0.429 0.01175 0.406 0.452
We can also use the print()
function. The print.rmean = TRUE
argument is used to obtain the mean survival time and its standard error (se).
NOTE: The restricted mean survival time (RMST) is a specific survival measure more and more used in cancer survival analysis and which is often defined as the area under the survival curve, given we observe patients up to restricted time T (more details in Resources section).
# print linelistsurv_fit object with mean survival time and its se.
print(linelistsurv_fit, print.rmean = TRUE)
Call: survfit(formula = survobj ~ 1)
n events rmean* se(rmean) median 0.95LCL 0.95UCL
[1,] 4539 1952 33.1 0.539 17 16 18
* restricted mean with upper limit = 64
TIP: We can create the surv object directly in the survfit()
function and save a line of code. This will then look like: linelistsurv_quick <- survfit(Surv(futime, event) ~ 1, data=linelist_surv)
.
Cumulative hazard
Besides the summary()
function, we can also use the str()
function that gives more details on the structure of the survfit()
object. It is a list of 16 elements.
Among these elements is an important one: cumhaz
, which is a numeric vector. This could be plotted to allow show the cumulative hazard, with the hazard being the instantaneous rate of event occurrence (see references).
str(linelistsurv_fit)
List of 16
$ n : int 4539
$ time : num [1:59] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:59] 4539 4500 4394 4176 3899 ...
$ n.event : num [1:59] 30 69 149 194 214 210 179 167 145 109 ...
$ n.censor : num [1:59] 9 37 69 83 93 159 145 139 137 121 ...
$ surv : num [1:59] 0.993 0.978 0.945 0.901 0.852 ...
$ std.err : num [1:59] 0.00121 0.00222 0.00359 0.00496 0.00628 ...
$ cumhaz : num [1:59] 0.00661 0.02194 0.05585 0.10231 0.15719 ...
$ std.chaz : num [1:59] 0.00121 0.00221 0.00355 0.00487 0.00615 ...
$ type : chr "right"
$ logse : logi TRUE
$ conf.int : num 0.95
$ conf.type: chr "log"
$ lower : num [1:59] 0.991 0.974 0.938 0.892 0.841 ...
$ upper : num [1:59] 0.996 0.982 0.952 0.91 0.862 ...
$ call : language survfit(formula = survobj ~ 1)
- attr(*, "class")= chr "survfit"
Plotting Kaplan-Meir curves
Once the KM estimates are fitted, we can visualize the probability of being alive through a given time using the basic plot()
function that draws the “Kaplan-Meier curve”. In other words, the curve below is a conventional illustration of the survival experience in the whole patient group.
We can quickly verify the follow-up time min and max on the curve.
An easy way to interpret is to say that at time zero, all the participants are still alive and survival probability is then 100%. This probability decreases over time as patients die. The proportion of participants surviving past 60 days of follow-up is around 40%.
plot(linelistsurv_fit,
xlab = "Days of follow-up", # x-axis label
ylab="Survival Probability", # y-axis label
main= "Overall survival curve" # figure title
)
The confidence interval of the KM survival estimates are also plotted by default and can be dismissed by adding the option conf.int = FALSE
to the plot()
command.
Since the event of interest is “death”, drawing a curve describing the complements of the survival proportions will lead to drawing the cumulative mortality proportions. This can be done with lines()
, which adds information to an existing plot.
# original plot
plot(
linelistsurv_fit,xlab = "Days of follow-up",
ylab = "Survival Probability",
mark.time = TRUE, # mark events on the curve: a "+" is printed at every event
conf.int = FALSE, # do not plot the confidence interval
main = "Overall survival curve and cumulative mortality"
)
# draw an additional curve to the previous plot
lines(
linelistsurv_fit,lty = 3, # use different line type for clarity
fun = "event", # draw the cumulative events instead of the survival
mark.time = FALSE,
conf.int = FALSE
)
# add a legend to the plot
legend(
"topright", # position of legend
legend = c("Survival", "Cum. Mortality"), # legend text
lty = c(1, 3), # line types to use in the legend
cex = .85, # parametes that defines size of legend text
bty = "n" # no box type to be drawn for the legend
)
27.4 Comparison of survival curves
To compare the survival within different groups of our observed participants or patients, we might need to first look at their respective survival curves and then run tests to evaluate the difference between independent groups. This comparison can concern groups based on gender, age, treatment, comorbidity…
Log rank test
The log rank test is a popular test that compares the entire survival experience between two or more independent groups and can be thought of as a test of whether the survival curves are identical (overlapping) or not (null hypothesis of no difference in survival between the groups). The survdiff()
function of the survival package allows running the log-rank test when we specify rho = 0
(which is the default). The test results gives a chi-square statistic along with a p-value since the log rank statistic is approximately distributed as a chi-square test statistic.
We first try to compare the survival curves by gender group. For this, we first try to visualize it (check whether the two survival curves are overlapping). A new survfit object will be created with a slightly different formula. Then the survdiff object will be created.
By supplying ~ gender
as the right side of the formula, we no longer plot the overall survival but instead by gender.
# create the new survfit object based on gender
<- survfit(Surv(futime, event) ~ gender, data = linelist_surv) linelistsurv_fit_sex
Now we can plot the survival curves by gender. Have a look at the order of the strata levels in the gender column before defining your colors and legend.
# set colors
<- c("lightgreen", "darkgreen")
col_sex
# create plot
plot(
linelistsurv_fit_sex,col = col_sex,
xlab = "Days of follow-up",
ylab = "Survival Probability")
# add legend
legend(
"topright",
legend = c("Female","Male"),
col = col_sex,
lty = 1,
cex = .9,
bty = "n")
And now we can compute the test of the difference between the survival curves using survdiff()
#compute the test of the difference between the survival curves
::survdiff(
survivalSurv(futime, event) ~ gender,
data = linelist_surv
)
Call:
survival::survdiff(formula = Surv(futime, event) ~ gender, data = linelist_surv)
n=4321, 218 observations deleted due to missingness.
N Observed Expected (O-E)^2/E (O-E)^2/V
gender=f 2156 924 909 0.255 0.524
gender=m 2165 929 944 0.245 0.524
Chisq= 0.5 on 1 degrees of freedom, p= 0.5
We see that the survival curve for women and the one for men overlap and the log-rank test does not give evidence of a survival difference between women and men.
Some other R packages allow illustrating survival curves for different groups and testing the difference all at once. Using the ggsurvplot()
function from the survminer package, we can also include in our curve the printed risk tables for each group, as well the p-value from the log-rank test.
CAUTION: survminer functions require that you specify the survival object and again specify the data used to fit the survival object. Remember to do this to avoid non-specific error messages.
::ggsurvplot(
survminer
linelistsurv_fit_sex, data = linelist_surv, # again specify the data used to fit linelistsurv_fit_sex
conf.int = FALSE, # do not show confidence interval of KM estimates
surv.scale = "percent", # present probabilities in the y axis in %
break.time.by = 10, # present the time axis with an increment of 10 days
xlab = "Follow-up days",
ylab = "Survival Probability",
pval = T, # print p-value of Log-rank test
pval.coord = c(40,.91), # print p-value at these plot coordinates
risk.table = T, # print the risk table at bottom
legend.title = "Gender", # legend characteristics
legend.labs = c("Female","Male"),
font.legend = 10,
palette = "Dark2", # specify color palette
surv.median.line = "hv", # draw horizontal and vertical lines to the median survivals
ggtheme = theme_light() # simplify plot background
)
We may also want to test for differences in survival by the source of infection (source of contamination).
In this case, the Log rank test gives enough evidence of a difference in the survival probabilities at alpha= 0.005
. The survival probabilities for patients that were infected at funerals are higher than the survival probabilities for patients that got infected in other places, suggesting a survival benefit.
<- survfit(
linelistsurv_fit_source Surv(futime, event) ~ source,
data = linelist_surv
)
# plot
ggsurvplot(
linelistsurv_fit_source,data = linelist_surv,
size = 1, linetype = "strata", # line types
conf.int = T,
surv.scale = "percent",
break.time.by = 10,
xlab = "Follow-up days",
ylab= "Survival Probability",
pval = T,
pval.coord = c(40,.91),
risk.table = T,
legend.title = "Source of \ninfection",
legend.labs = c("Funeral", "Other"),
font.legend = 10,
palette = c("#E7B800","#3E606F"),
surv.median.line = "hv",
ggtheme = theme_light()
)
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Did you mean to use `annotate()`?
All aesthetics have length 1, but the data has 2 rows.
ℹ Did you mean to use `annotate()`?
All aesthetics have length 1, but the data has 2 rows.
ℹ Did you mean to use `annotate()`?
All aesthetics have length 1, but the data has 2 rows.
ℹ Did you mean to use `annotate()`?
27.5 Cox regression analysis
Cox proportional hazards regression is one of the most popular regression techniques for survival analysis. Other models can also be used since the Cox model requires important assumptions that need to be verified for an appropriate use such as the proportional hazards assumption: see references.
In a Cox proportional hazards regression model, the measure of effect is the hazard rate (HR), which is the risk of failure (or the risk of death in our example), given that the participant has survived up to a specific time. Usually, we are interested in comparing independent groups with respect to their hazards, and we use a hazard ratio, which is analogous to an odds ratio in the setting of multiple logistic regression analysis. The cox.ph()
function from the survival package is used to fit the model. The function cox.zph()
from survival package may be used to test the proportional hazards assumption for a Cox regression model fit.
NOTE: A probability must lie in the range 0 to 1. However, the hazard represents the expected number of events per one unit of time.
- If the hazard ratio for a predictor is close to 1 then that predictor does not affect survival,
- if the HR is less than 1, then the predictor is protective (i.e., associated with improved survival),
- and if the HR is greater than 1, then the predictor is associated with increased risk (or decreased survival).
Fitting a Cox model
We can first fit a model to assess the effect of age and gender on the survival. By just printing the model, we have the information on:
- the estimated regression coefficients
coef
which quantifies the association between the predictors and the outcome, - their exponential (for interpretability,
exp(coef)
) which produces the hazard ratio, - their standard error
se(coef)
, - the z-score: how many standard errors is the estimated coefficient away from 0,
- and the p-value: the probability that the estimated coefficient could be 0.
The summary()
function applied to the cox model object gives more information, such as the confidence interval of the estimated HR and the different test scores.
The effect of the first covariate gender
is presented in the first row. genderm
(male) is printed, implying that the first strata level (“f”), i.e the female group, is the reference group for the gender. Thus the interpretation of the test parameter is that of men compared to women. The p-value indicates there was not enough evidence of an effect of the gender on the expected hazard or of an association between gender and all-cause mortality.
The same lack of evidence is noted regarding age-group.
#fitting the cox model
<- survival::coxph(
linelistsurv_cox_sexage Surv(futime, event) ~ gender + age_cat_small,
data = linelist_surv
)
#printing the model fitted
linelistsurv_cox_sexage
Call:
survival::coxph(formula = Surv(futime, event) ~ gender + age_cat_small,
data = linelist_surv)
coef exp(coef) se(coef) z p
genderm -0.03149 0.96900 0.04767 -0.661 0.509
age_cat_small5-19 0.09400 1.09856 0.06454 1.456 0.145
age_cat_small20+ 0.05032 1.05161 0.06953 0.724 0.469
Likelihood ratio test=2.8 on 3 df, p=0.4243
n= 4321, number of events= 1853
(218 observations deleted due to missingness)
#summary of the model
summary(linelistsurv_cox_sexage)
Call:
survival::coxph(formula = Surv(futime, event) ~ gender + age_cat_small,
data = linelist_surv)
n= 4321, number of events= 1853
(218 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
genderm -0.03149 0.96900 0.04767 -0.661 0.509
age_cat_small5-19 0.09400 1.09856 0.06454 1.456 0.145
age_cat_small20+ 0.05032 1.05161 0.06953 0.724 0.469
exp(coef) exp(-coef) lower .95 upper .95
genderm 0.969 1.0320 0.8826 1.064
age_cat_small5-19 1.099 0.9103 0.9680 1.247
age_cat_small20+ 1.052 0.9509 0.9176 1.205
Concordance= 0.514 (se = 0.007 )
Likelihood ratio test= 2.8 on 3 df, p=0.4
Wald test = 2.78 on 3 df, p=0.4
Score (logrank) test = 2.78 on 3 df, p=0.4
It was interesting to run the model and look at the results but a first look to verify whether the proportional hazards assumptions is respected could help saving time.
<- survival::cox.zph(linelistsurv_cox_sexage)
test_ph_sexage test_ph_sexage
chisq df p
gender 0.454 1 0.50
age_cat_small 0.838 2 0.66
GLOBAL 1.399 3 0.71
NOTE: A second argument called method can be specified when computing the cox model, that determines how ties are handled. The default is “efron”, and the other options are “breslow” and “exact”.
In another model we add more risk factors such as the source of infection and the number of days between date of onset and admission. This time, we first verify the proportional hazards assumption before going forward.
In this model, we have included a continuous predictor (days_onset_hosp
). In this case we interpret the parameter estimates as the increase in the expected log of the relative hazard for each one unit increase in the predictor, holding other predictors constant. We first verify the proportional hazards assumption.
#fit the model
<- coxph(
linelistsurv_cox Surv(futime, event) ~ gender + age_years+ source + days_onset_hosp,
data = linelist_surv
)
#test the proportional hazard model
<- cox.zph(linelistsurv_cox)
linelistsurv_ph_test linelistsurv_ph_test
chisq df p
gender 0.45062 1 0.50
age_years 0.00199 1 0.96
source 1.79622 1 0.18
days_onset_hosp 31.66167 1 1.8e-08
GLOBAL 34.08502 4 7.2e-07
The graphical verification of this assumption may be performed with the function ggcoxzph()
from the survminer package.
::ggcoxzph(linelistsurv_ph_test) survminer
The model results indicate there is a negative association between onset to admission duration and all-cause mortality. The expected hazard is 0.9 times lower in a person who who is one day later admitted than another, holding gender constant. Or in a more straightforward explanation, a one unit increase in the duration of onset to admission is associated with a 10.7% (coef *100
) decrease in the risk of death.
Results show also a positive association between the source of infection and the all-cause mortality. Which is to say there is an increased risk of death (1.21x) for patients that got a source of infection other than funerals.
#print the summary of the model
summary(linelistsurv_cox)
Call:
coxph(formula = Surv(futime, event) ~ gender + age_years + source +
days_onset_hosp, data = linelist_surv)
n= 2772, number of events= 1180
(1767 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
genderm 0.004710 1.004721 0.060827 0.077 0.9383
age_years -0.002249 0.997753 0.002421 -0.929 0.3528
sourceother 0.178393 1.195295 0.084291 2.116 0.0343 *
days_onset_hosp -0.104063 0.901169 0.014245 -7.305 2.77e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
genderm 1.0047 0.9953 0.8918 1.1319
age_years 0.9978 1.0023 0.9930 1.0025
sourceother 1.1953 0.8366 1.0133 1.4100
days_onset_hosp 0.9012 1.1097 0.8764 0.9267
Concordance= 0.566 (se = 0.009 )
Likelihood ratio test= 71.31 on 4 df, p=1e-14
Wald test = 59.22 on 4 df, p=4e-12
Score (logrank) test = 59.54 on 4 df, p=4e-12
We can verify this relationship with a table:
%>%
linelist_case_data tabyl(days_onset_hosp, outcome) %>%
adorn_percentages() %>%
adorn_pct_formatting()
days_onset_hosp Death Recover NA_
0 44.3% 31.4% 24.3%
1 46.6% 32.2% 21.2%
2 43.0% 32.8% 24.2%
3 45.0% 32.3% 22.7%
4 41.5% 38.3% 20.2%
5 40.0% 36.2% 23.8%
6 32.2% 48.7% 19.1%
7 31.8% 38.6% 29.5%
8 29.8% 38.6% 31.6%
9 30.3% 51.5% 18.2%
10 16.7% 58.3% 25.0%
11 36.4% 45.5% 18.2%
12 18.8% 62.5% 18.8%
13 10.0% 60.0% 30.0%
14 10.0% 50.0% 40.0%
15 28.6% 42.9% 28.6%
16 20.0% 80.0% 0.0%
17 0.0% 100.0% 0.0%
18 0.0% 100.0% 0.0%
22 0.0% 100.0% 0.0%
NA 52.7% 31.2% 16.0%
We would need to consider and investigate why this association exists in the data. One possible explanation could be that patients who live long enough to be admitted later had less severe disease to begin with. Another perhaps more likely explanation is that since we used a simulated fake dataset, this pattern does not reflect reality!
Forest plots
We can then visualize the results of the cox model using the practical forest plots with the ggforest()
function of the survminer package.
ggforest(linelistsurv_cox, data = linelist_surv)
27.6 Time-dependent covariates in survival models
Some of the following sections have been adapted with permission from an excellent introduction to survival analysis in R by Dr. Emily Zabor
In the last section we covered using Cox regression to examine associations between covariates of interest and survival outcomes.But these analyses rely on the covariate being measured at baseline, that is, before follow-up time for the event begins.
What happens if you are interested in a covariate that is measured after follow-up time begins? Or, what if you have a covariate that can change over time?
For example, maybe you are working with clinical data where you repeated measures of hospital laboratory values that can change over time. This is an example of a Time Dependent Covariate. In order to address this you need a special setup, but fortunately the cox model is very flexible and this type of data can also be modeled with tools from the survival package.
Time-dependent covariate setup
Analysis of time-dependent covariates in R requires setup of a special dataset. If interested, see the more detailed paper on this by the author of the survival package Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model.
For this, we’ll use a new dataset from the SemiCompRisks
package named BMT
, which includes data on 137 bone marrow transplant patients. The variables we’ll focus on are:
T1
- time (in days) to death or last follow-up
delta1
- death indicator; 1-Dead, 0-Alive
TA
- time (in days) to acute graft-versus-host disease
deltaA
- acute graft-versus-host disease indicator;- 1 - Developed acute graft-versus-host disease
- 0 - Never developed acute graft-versus-host disease
- 1 - Developed acute graft-versus-host disease
We’ll load this dataset from the survival package using the base R command data()
, which can be used for loading data that is already included in a R package that is loaded. The data frame BMT
will appear in your R environment.
data(BMT, package = "SemiCompRisks")
Add unique patient identifier
There is no unique ID column in the BMT
data, which is needed to create the type of dataset we want. So we use the function rowid_to_column()
from the tidyverse package tibble to create a new id column called my_id
(adds column at start of data frame with sequential row ids, starting at 1). We name the data frame bmt
.
<- rowid_to_column(BMT, "my_id") bmt
The dataset now looks like this:
Expand patient rows
Next, we’ll use the tmerge()
function with the event()
and tdc()
helper functions to create the restructured dataset. Our goal is to restructure the dataset to create a separate row for each patient for each time interval where they have a different value for deltaA
. In this case, each patient can have at most two rows depending on whether they developed acute graft-versus-host disease during the data collection period. We’ll call our new indicator for the development of acute graft-versus-host disease agvhd
.
tmerge()
creates a long dataset with multiple time intervals for the different covariate values for each patientevent()
creates the new event indicator to go with the newly-created time intervalstdc()
creates the time-dependent covariate column,agvhd
, to go with the newly created time intervals
<-
td_dat tmerge(
data1 = bmt %>% select(my_id, T1, delta1),
data2 = bmt %>% select(my_id, T1, delta1, TA, deltaA),
id = my_id,
death = event(T1, delta1),
agvhd = tdc(TA)
)
To see what this does, let’s look at the data for the first 5 individual patients.
The variables of interest in the original data looked like this:
%>%
bmt select(my_id, T1, delta1, TA, deltaA) %>%
filter(my_id %in% seq(1, 5))
my_id T1 delta1 TA deltaA
1 1 2081 0 67 1
2 2 1602 0 1602 0
3 3 1496 0 1496 0
4 4 1462 0 70 1
5 5 1433 0 1433 0
The new dataset for these same patients looks like this:
%>%
td_dat filter(my_id %in% seq(1, 5))
my_id T1 delta1 tstart tstop death agvhd
1 1 2081 0 0 67 0 0
2 1 2081 0 67 2081 0 1
3 2 1602 0 0 1602 0 0
4 3 1496 0 0 1496 0 0
5 4 1462 0 0 70 0 0
6 4 1462 0 70 1462 0 1
7 5 1433 0 0 1433 0 0
Now some of our patients have two rows in the dataset corresponding to intervals where they have a different value of our new variable, agvhd
. For example, Patient 1 now has two rows with a agvhd
value of zero from time 0 to time 67, and a value of 1 from time 67 to time 2081.
Cox regression with time-dependent covariates
Now that we’ve reshaped our data and added the new time-dependent aghvd
variable, let’s fit a simple single variable cox regression model. We can use the same coxph()
function as before, we just need to change our Surv()
function to specify both the start and stop time for each interval using the time1 =
and time2 =
arguments.
= coxph(
bmt_td_model Surv(time = tstart, time2 = tstop, event = death) ~ agvhd,
data = td_dat
)
summary(bmt_td_model)
Call:
coxph(formula = Surv(time = tstart, time2 = tstop, event = death) ~
agvhd, data = td_dat)
n= 163, number of events= 80
coef exp(coef) se(coef) z Pr(>|z|)
agvhd 0.3351 1.3980 0.2815 1.19 0.234
exp(coef) exp(-coef) lower .95 upper .95
agvhd 1.398 0.7153 0.8052 2.427
Concordance= 0.535 (se = 0.024 )
Likelihood ratio test= 1.33 on 1 df, p=0.2
Wald test = 1.42 on 1 df, p=0.2
Score (logrank) test = 1.43 on 1 df, p=0.2
Again, we’ll visualize our cox model results using the ggforest()
function from the survminer package.:
ggforest(bmt_td_model, data = td_dat)
As you can see from the forest plot, confidence interval, and p-value, there does not appear to be a strong association between death and acute graft-versus-host disease in the context of our simple model.
27.7 Resources
Survival Analysis Part I: Basic concepts and first analyses
Survival analysis in infectious disease research: Describing events in time
Chapter on advanced survival models Princeton
Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model