Survival data with directly observed failure times
Interval-censored data
- Modelling
  - Weibull distribution
- Quantifying predictive performance

These notes were written while working with colleagues on constructing and reporting predictive models using survival analysis. They describe how to to fit a survival model to a training dataset and evaluate its predictive performance on a test dataset.

Survival data with directly observed failure times

The likelihood of the hazard rate \(\mu\), given an observation of an event after an interval of length \(T\), can be modelled in three ways:

as an exponential likelihood given observed time to failure: \(\mu T e^{-\mu T}\).
as a Poisson likelihood given observation of one arrival in time \(T\), followed by censoring at first arrival. This is \(\mu T e^{-\mu T}\) as before.
in the limit of large \(N\), as a binomial likelihood given observation of \(N - 1\) subintervals with 0 events and 1 subinterval with an event: \(\left( 1 - p \right)^{N - 1}p\) where \(p = 1 - e^{-\mu T / N}\), ignoring the normalizing constant.

So directly observed failure times can be modelled either as times to failure with exponential likelihood, or as a counting process, in which failures arise from a Poisson arrival process and the counting is censored after the first arrival for each individual. The advantage of the counting process approach is that we can use a generalized linear model with logarithmic link function and Poisson likelihood. There are many advantages to working in the framework of generalized linear models.

The only advantage of using survival regression with survreg is that we can model other distributions of time to failure such as the Weibull distribution. But where we have directly observed failure times, it is usually more useful to model directly how the hazard rate depends on with time since baseline by reformatting the data as short person-time intervals and specifying time since baseline as a time-updated covariate.

The binomial likelihood could be used to fit the model to data with many short time intervals, though this would be rather inefficient. The real usefulness of this formulation is to assess the predictive performance as described below.

Where a cohort is being followed over several years, we will usually format the data as person-time intervals of equal length to allow the dependence of the hazard rate on time since baseline to be modelled by including a time-updated covariate term. The baseline hazard function is the hazard rate as a function of time since follow-up, with all other covariates set to zero.

Cox regression eliminates the baseline hazard function by stratifying the follow-up time into intervals defined by the time of the next event, and maximizing the likelihood conditional on the number of events in each interval. Cox regression is identical to conditional logistic regression.
Cox regression is useful where we are not interested in the form of the baseline hazard function and do not want to have to make any assumptions about it, for instance in analysis of clinical trials. But for risk prediction we need the baseline hazard function, not just the rate ratios.

A common practice in constructing disease prediction models (for instance for the QRISK score for cardiovascular risk) is first to fit a Cox regression model, then to fit the baseline hazard function. There is no obvious advantage to this two-step procedure over fitting a Poisson regression in a single step. The historical basis for this two-step procedure may be that Cox regression was an established statistical procedure when the Framingham risk prediction models were constructed in the 1980s, but generalized linear models (of which Poisson regression is a special case) were not widely taught to statisticians until the 1990s.

Model diagnostics for Cox regression models include a test of the “proportional hazards assumption” that the log rate ratios do not vary with the baseline hazard function. In a Poisson regression model with a time-updated covariate that encodes time since baseline, we can simply test for interaction of this covariate with the other covariates, and include interaction terms where they improve the fit of the model or its predictive performance.

R functions for survival analysis

Three functions can be used to fit a time to failure model as a Poisson arrival process:

glm with family=poisson, counts as outcome and follow-up time encoded as an offset term. An offset term in a linear model is a term for which the coefficient is fixed at 1.
glm with family=quasipoisson, rates as outcome, and follow-up time as weights.

Time to event can be modelled as a dependent variable with exponential likelihood using survival regression

survreg with dist=exponential, and follow-up time as dependent variable.

All three methods give the same parameter estimates. The family=quasipoisson method gives standard errors of parameters that are much larger than the correct value, though the deviance is identical to that obtained with the family=poisson method. The signs of the parameter estimates obtained survreg are reversed because the dependent variable is the waiting time to the first event rather than the arrival rate.

Simulating a survival dataset

For this exercise, we simulate a dataset with exponentially distributed failure times (equivalent to a Poisson arrival process), and exponentially distributed exit times representing censoring from other causes. The model has two covariates: a numeric variable x1 and a categoric variable x2_ with three levels.

Where failure times are directly observed, individuals are censored at whichever is the earler of time at first failure or time at first exit. For interval-censored data, individuals are examined at intervals of fixed length, and remain under observation until the end of the last interval in which they were observed at baseline.

rm(list=ls())

library(caret)
library(pROC)
library(survival)
library(gridExtra)
library(pander)
library(icenReg)
library(data.table)
library(survival)
library(wevid)

hazardrate.poisson <- function(model.poisson, Xmatrix) {
    ## returns hazard rates using coeffs from model.poisson, covariates from matrix Xmatrix, which has been generated by model.matrix to add a column of ones and to recode categoric variables as indicator variables
    beta <- matrix(model.poisson$coefficients, ncol=1)
    xbeta <- as.numeric(Xmatrix %*% beta) # linear predictor
    return(exp(xbeta))
}

to.persontime <- function(y, X, tobs, interval.length) {# returns person-time data frame
    ## for each person we have vector variables y, tobs and data frame X
    survdata <- NULL
    for(i in 1:length(y)) {
        num.intervals <- ceiling(tobs[i] / interval.length)
        tobs.intervals <- rep(interval.length, num.intervals)
        tobs.icens <- tobs.intervals # for interval-censored data, last interval rounded up 
        ## for directly observed failure times, keep exact failure time for last interval
        if(tobs[i] < sum(tobs.intervals)) {
            tobs.intervals[num.intervals] <- tobs[i] %% interval.length
        }
        
        y.interval <- rep(0, num.intervals)
        y.interval[num.intervals] <- y[i]
        Xrow <- X[i, ]
        Xrows <- Xrow[rep(1, num.intervals), ]
        ## ic_par requires l=0 for y=1, r=Inf for y=0
        survdata <- rbind(survdata,
                      data.table(id=rep(i, num.intervals),
                                 tobs=tobs.intervals,
                                 tobs.icens=tobs.icens,
                                 l.time=ifelse(y.interval, 0, tobs.icens),
                                 u.time=ifelse(y.interval,
                                                 tobs.icens, Inf),
                                 y=y.interval,
                                 Xrows))
  }
  return(survdata)                       
}

plot.calibration <- function(observed, expected) {
    riskdecile.group <- cut(expected, quantile(x=expected, probs=seq(0, 1, by=0.1)))
    expected.riskdecile <- tapply(expected, riskdecile.group, sum)
    observed.riskdecile <- tapply(observed, riskdecile.group, sum)
    calib <- data.table(x=expected.riskdecile, y=observed.riskdecile)
    max.xy <- max(c(calib$x, calib$y))

    p <- ggplot(data=calib, aes(x=x, y=y)) + geom_point() +
        geom_abline(intercept=0, slope=1, linetype="dotted") +  
        xlim(c(0, max.xy)) +  
        ylim(c(0, max.xy)) +
        xlab(stringr::str_wrap("Expected events by decile of predicted risk", 35)) +
        ylab("Observed events")
    return(p)
}

plot.calibration.cumulative <- function(table.bins) {
    
    calib <- data.frame(x=1:10, y=as.integer(table.bins[1, ]))
    expected.bin <- 0.1 * sum(as.numeric(table.bins[2, ]))
    p <- ggplot(data=calib, aes(x=x, y=y)) + geom_point() +
        geom_abline(intercept=expected.bin, slope=0, linetype="dotted") +
        geom_ribbon(aes(ymin=qpois(p=0.25, lambda=expected.bin),
                        ymax=qpois(p=0.75, lambda=expected.bin)), fill="grey", alpha=0.3) +
        scale_x_continuous(breaks=1:10) +
        xlab(stringr::str_wrap(
            paste("Ranking of predicted risk grouped into 10 bins of",
                  round(expected.bin, 1), "expected events"),
            40)) +
        ylab("Observed events")
    return(p)
}

table.calibration.cumulative <- function(observed, expected) {
    obsexp <- data.table(observed, expected)
    setkey(obsexp, expected)
    obsexp[, cumulative.exp := cumsum(expected)]
    cumulativeexp.group <- cut(obsexp$cumulative.exp,
                               sum(obsexp$expected) * seq(0, 1, by=0.1))
    observed.group <- tapply(obsexp$observed, cumulativeexp.group, sum)
    calib <- data.frame(x=1:10, y=observed.group)
    expected.bin <- 0.1 * sum(obsexp$expected)
    table.bins <- as.data.frame(rbind(observed.group, rep(expected.bin, 10)))
    colnames(table.bins) <- 1:10
    rownames(table.bins) <- c("Observed", "Expected")
    return(table.bins)
}

test.calibration.table <- function(table.bins) {
    obs <- table.bins[1, ]
    exp <- table.bins[2, ]
    df <- ncol(table.bins)
    teststat.chisq <- sum((obs - exp) ^2 / exp)
    pvalue <- 1 - pchisq(q=teststat.chisq, df=df)
    return(c(teststat.chisq=teststat.chisq, df=df,pvalue=pvalue)) 
}

    
## simulate failure-time dataset with 2 covariates
N <- 2000
alpha <- -5
beta <- matrix(c(alpha, 1.5, -0.25, -0.5), ncol=1)  # coefficients (log rate ratios)
N.years <- 10 # censor follow-up at N.years 


x1 <- rnorm(N) 
x2_ <- as.factor(1 + rbinom(N, 2, 0.2)) # categoric with 3 levels
X <- data.table(x1, x2_)

Xmatrix <- model.matrix(object= ~ x1 + x2_, data=X)

mu <- exp(alpha + as.numeric(Xmatrix %*% beta)) # vector of hazard rates

tfail     <- 1 + rexp(N, rate=mu) # time to failure
tobs.exit <- 1 + rexp(N, rate=0.75 * mean(mu)) # time to exit from other cause
tobs.exit[tobs.exit > N.years * 365.25] <- N.years * 365.25 

direct.obs <- data.table(serialno=1:N, X)

## tobs before fail or exit
direct.obs[, tobs := ifelse(tfail < tobs.exit, tfail, tobs.exit)] 
direct.obs[, y := as.integer(tfail <= tobs.exit)]

cat("Simulation generates dataset of", N, "individuals followed for average",
    sum(direct.obs$tobs) / N, "days with", sum(direct.obs$y),"events\n")

## Simulation generates dataset of 2000 individuals followed for average 2755.583 days with 409 events

## test-train split
N.train <- floor(N * 2 / 3)
direct.obs.train <- direct.obs[1:N.train, ]
direct.obs.test  <- direct.obs[(N.train + 1):N, ]

## poisson model with offset 
poisson.model <- with(direct.obs.train,
                      glm(formula = y ~ offset(log(tobs)) + x1 + x2_,
                          family=poisson))
print(summary(poisson.model))

## 
## Call:
## glm(formula = y ~ offset(log(tobs)) + x1 + x2_, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9726  -0.5649  -0.3277  -0.1198   3.5911  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.04226    0.10350 -97.025   <2e-16 ***
## x1            1.48213    0.07563  19.598   <2e-16 ***
## x2_2         -0.18981    0.13315  -1.426    0.154    
## x2_3         -0.17897    0.31049  -0.576    0.564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1438.5  on 1332  degrees of freedom
## Residual deviance: 1014.2  on 1329  degrees of freedom
## AIC: 1572.2
## 
## Number of Fisher Scoring iterations: 6

## qpoisson model gives same parameter estimates as poisson model with offset, but larger standard errors
qpoisson.model <- with(direct.obs.train,
                       glm(formula = y/tobs ~ x1 + x2_, family=quasipoisson,
                           weights=tobs))
print(summary(qpoisson.model))

## 
## Call:
## glm(formula = y/tobs ~ x1 + x2_, family = quasipoisson, weights = tobs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9726  -0.5649  -0.3277  -0.1198   3.5911  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.0423     0.2539 -39.547  < 2e-16 ***
## x1            1.4821     0.1856   7.988 2.95e-15 ***
## x2_2         -0.1898     0.3267  -0.581    0.561    
## x2_3         -0.1790     0.7618  -0.235    0.814    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 6.019826)
## 
##     Null deviance: 1438.5  on 1332  degrees of freedom
## Residual deviance: 1014.2  on 1329  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 12

Survival regression with exponential likelihood is identical to Poisson regression censored at first event

surv.model <- with(direct.obs.train, survreg(Surv(time=tobs, event=y) ~ x1 + x2_,
                                             dist="exponential"))
print(summary(surv.model))

## 
## Call:
## survreg(formula = Surv(time = tobs, event = y) ~ x1 + x2_, dist = "exponential")
##               Value Std. Error      z      p
## (Intercept) 10.0423     0.1035  97.02 <2e-16
## x1          -1.4821     0.0756 -19.60 <2e-16
## x2_2         0.1898     0.1332   1.43   0.15
## x2_3         0.1790     0.3105   0.58   0.56
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -2674.7   Loglik(intercept only)= -2886.8
##  Chisq= 424.24 on 3 degrees of freedom, p= 1.2e-91 
## Number of Newton-Raphson Iterations: 6 
## n= 1333

## set length of person-time intervals to be used for reformatting the data
interval.length <- 365
interval.short <- 28
interval.shorter <- 7

Survival regression with survreg can also be used to model a Weibull distribution for time to failure. But where we have directly observed failure times, it is usually more useful to model how the hazard rate varies with time since baseline by reformatting the data as short person-time intervals and specifying time since baseline as a time-updated covariate.

survdata.train <- with(direct.obs.train,
                       to.persontime(y, X=data.table(x1, x2_), tobs, interval.length))

survdata.test <- with(direct.obs.test,
                       to.persontime(y, X=data.table(x1, x2_), tobs, interval.length))

poisson.model.surv <- with(survdata.train,
                           glm(formula = y ~ offset(log(tobs)) + x1 + x2_, family=poisson))

print(summary(poisson.model.surv))

## 
## Call:
## glm(formula = y ~ offset(log(tobs)) + x1 + x2_, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8703  -0.2334  -0.1400  -0.0789   4.3283  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.04226    0.10350 -97.024   <2e-16 ***
## x1            1.48213    0.07563  19.598   <2e-16 ***
## x2_2         -0.18981    0.13315  -1.426    0.154    
## x2_3         -0.17897    0.31049  -0.576    0.564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2592.4  on 11089  degrees of freedom
## Residual deviance: 2168.2  on 11086  degrees of freedom
## AIC: 2726.2
## 
## Number of Fisher Scoring iterations: 7

The original training dataset of 1333 individuals is reformatted as 11090 person-time intervals of length up to 365 days. There are 3664361 person-days of follow-up. The model fitted to this reformatted dataset is identical to that fitted to the original dataset. The reduction in deviance from the null model is the same with both approaches, but the null deviance is higher. This is because normalizing constants are ignored.

The fitted values generated by glm() with Poisson likelihood are the expected numbers of failures up to the end of each person time interval, allowing for censoring. On the training data, the sum of the observed values is 275 and the sum of the fitted values is 275; this agreement between observed and fitted averages of the response variable is guaranteed when we maximize the likelihood to fit a model that has a likelihood in the exponential family.

Quantifying predictive performance and calibration

With a survival dataset reformatted as person-time intervals of fixed length, the predictive performance of the model can be evaluated by comparing “case” person-time intervals in which an event occurs with “noncase” person-time intervals in which no event occurs. The problem is that some of these person-time intervals are censored, either at the first event or at exit from other causes. Thus the expected number of events in each person-time interval, calculated allowing for censoring at first event, does not rank individuals by risk. As Uno et al (2011) discuss, the classical C-statistic, which uses only the ranking of risk scores, cannot be used in this situation. Similar problems arise with calibration plots, as discussed by Crowson et al (2016) and Austin et al (2020).

The fundamental problem is that because person-time intervals of unequal length are not exchangeable, we cannot represent the response variable (case / noncase) on these intervals as a mixture of independent and identically-distributed Bernoulli variables conditional on the risk scores.

Demonstrating the problem: discrimination and calibration using fitted values for censored interval lengths

For simplicity I demonstrate this on the training dataset, where the observed and predicted averages of the response variable are guaranteed to agree perfectly. If we use the fitted values \(\lambda\) (predicted numbers of events for each person-time interval) from the model to calculate a C-statistic, we get a rather low value. If we group the fitted values into deciles and plot observed against expected number of events by decile as in the plot below on the left, we see an excess of observed events in the lowest deciles and a deficity of observed events in the highest deciles. This is because in those person-time intervals where events have occurred, the observation time is right-censored so that the expected number of events is lower. The plot below on the right shows this mismatch in a different way, by ranking the person-time intervals by fitted values and grouping them into bins with equal numbers of expected events. The table that generates this plot can also be used to construct a test for miscalibration as a chi-square test with 10 degrees of freedom.

Cstat.censored <- pROC::auc(response=survdata.train$y,
                            predictor=poisson.model.surv$fitted.values)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

cat("C-statistic using fitted values calculated from censored person-time intervals:",
    round(Cstat.censored, 2), "\n")

## C-statistic using fitted values calculated from censored person-time intervals: 0.65

plot.obsexp <- plot.calibration(observed=survdata.train$y,
                                expected=poisson.model.surv$fitted.values)
table.bins <- table.calibration.cumulative(observed=survdata.train$y,
                            expected=poisson.model.surv$fitted.values)
plot.bins <- plot.calibration.cumulative(table.bins)

grid.arrange(plot.obsexp, plot.bins, nrow=1, top="Calibration plot using fitted values")

We can’t use this either to calculate the \(C\)-statistic or for a calibration plot. What happens if we ignore the censoring, and calculate the expected events as if individuals were not censored at first event but survive to the end of the interval? First we need to compute the fitted hazard rates \(\boldsymbol{\mu}\) from the covariate matrix \(\boldsymbol{X}\) and the vector \(\boldsymbol{\beta}\) of fitted coefficients. For this we use the R function model.matrix() which adds a column of ones for the intercept term and recodes each categoric variable with \(K\) levels as \(K - 1\) indicator variables.

Xmatrix.train <- model.matrix(object= ~ x1 + x2_, data=survdata.train)

print(head(survdata.train, 4))

##    id tobs tobs.icens l.time u.time y         x1 x2_
## 1:  1  365        365    365    Inf 0 -0.6611733   1
## 2:  1  365        365    365    Inf 0 -0.6611733   1
## 3:  1  365        365    365    Inf 0 -0.6611733   1
## 4:  1  365        365    365    Inf 0 -0.6611733   1

print(head(Xmatrix.train, 4))

##   (Intercept)         x1 x2_2 x2_3
## 1           1 -0.6611733    0    0
## 2           1 -0.6611733    0    0
## 3           1 -0.6611733    0    0
## 4           1 -0.6611733    0    0

## hazard rates for each individual are calculated using coefficents fitted to training data
survdata.train[, hrate:= hazardrate.poisson(model.poisson=poisson.model.surv,
                                            Xmatrix=Xmatrix.train)]

We then compute the probability of failure by the end of the interval as \(1 - \exp{\left( -\mu t \right) }\)

survdata.train[, p := 1 - exp(-hrate * interval.length)] # probability of event before end of interval

Cstat.uncensored <- with(survdata.train, pROC::auc(response=y, predictor=p))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

cat("C-statistic using fitted values calculated from uncensored person-time intervals:",
    round(Cstat.uncensored, 2), "\n")

## C-statistic using fitted values calculated from uncensored person-time intervals: 0.82

plot.obsexp <- with(survdata.train, plot.calibration(observed=y, expected=p))
table.bins <- with(survdata.train, table.calibration.cumulative(observed=y, expected=p))
plot.bins <- plot.calibration.cumulative(table.bins)
test.calib <- test.calibration.table(table.bins)
cat("Test for miscalibration ignoring censoring within intervals of length", interval.length,
    "gives chi-square statistic", round(test.calib[1], 2),
    "with", test.calib[2], "degrees of freedom, p =", signif(test.calib[3], 1), "\n")

## Test for miscalibration ignoring censoring within intervals of length 365 gives chi-square statistic 11.55 with 10 degrees of freedom, p = 0.3

grid.arrange(plot.obsexp, plot.bins, nrow=1, top=paste("Calibration plot ignoring censoring within intervals of length", interval.length, "days"))

With censoring ignored, the C-statistic and the calibration plot look better, but the sum of the observed events is 275 and the sum of the expected events is 294.3. This discrepancy between observed and expected events is because we have ignored censoring before the end of the person-time intervals.

If we ignore censoring and calculate the probability of an event before the end of a person-time interval of fixed length, the observations are exchangeable. This however inflates the total person-time of observation, so that the total expected events no longer equates to the total observed events; the “calibration in the large” cannot be evaluated.

Complicated methods have been proposed to overcome this problem. A simpler approach is to reformat the dataset with many short person-time intervals. In the limit of infinitesimally short person-time intervals, the Poisson and binomial likelihoods are equivalent and censoring within person-time intervals can be ignored. We can then use standard methods to evaluate the predictive performance and calibration of a binary classifier, as in a case-control study.

There is no need to refit the model: we can just re-use the coefficients from the model fitted to the dataset with longer person-time intervals. This model is correct: the problem is just to quantify how well it performs. We’ll start by reducing the interval length from 365 to 28 days.

## reformat dataset with short time intervals to generate X matrix
survdata.train.short <- with(direct.obs.train,
                             to.persontime(y, X=data.table(x1, x2_), tobs, interval.short))

Xmatrix.train.short <- model.matrix(object= ~ x1 + x2_, data=survdata.train.short)

## use coefficients from model already fitted to calculate hazard rates 
survdata.train.short[, hrate := hazardrate.poisson(model.poisson=poisson.model.surv,
                                                   Xmatrix=Xmatrix.train.short)]

## calculate probability of event before end of interval
survdata.train.short[, p := 1 - exp(-hrate * interval.short)]

survdata.train.short[, prior.probs := rep(mean(survdata.train.short$y),
                                          nrow(survdata.train.short))]

## calculate binomial log-likelihood of null model and fitted model
loglik.null <- with(survdata.train.short,
                    sum(y * log(prior.probs) + (1 - y) * log(1 - prior.probs)))

loglik.train <- with(survdata.train.short,
                     sum(y * log(p) + (1 - y) * log(1 - p)))

cat("Binomial log-likelihood given training data formatted as intervals of length",
    interval.short, "days:", round(loglik.null, 2), "for null model,",
    round(loglik.train, 2), "for full model\n")

## Binomial log-likelihood given training data formatted as intervals of length 28 days: -1971.61 for null model, -1759.25 for full model

plot.obsexp <- with(survdata.train.short, plot.calibration(observed=y, expected=p))
table.bins <- with(survdata.train.short, table.calibration.cumulative(observed=y, expected=p))
plot.bins <- plot.calibration.cumulative(table.bins)
test.calib <- test.calibration.table(table.bins)
cat("Test for miscalibration with intervals of length", interval.short,
    "gives chi-square statistic", round(test.calib[1], 2),
    "with", test.calib[2], "degrees of freedom, p =", signif(test.calib[3], 1), "\n")

## Test for miscalibration with intervals of length 28 gives chi-square statistic 8.58 with 10 degrees of freedom, p = 0.6

grid.arrange(plot.obsexp, plot.bins, nrow=1, top=paste("Calibration plot ignoring censoring within intervals of length", interval.short, "days"))

The training dataset reformatted with intervals of length 28 days still has 3664361 person-days of follow-up but now has 131572 observations. With censoring within short intervals ignored, the sum of the expected events is 275.87 compared with 275 observed events.

Multiplying the difference in binomial log-likelihood between the null model and the full model by minus 2 gives a close approximation to the difference in deviance between the null model and the full model calculated from the Poisson log-likelihood. With short enough person-time intervals, this would be exact.

Model comparison

Model comparison can be based either on the log-likelihood given the test data, or on the log-likelihood given the training data with a penalty for the number of parameters. The Akaike Information Criterion is defined as \(2k - 2 L_{\mathrm{train}}\) where \(k\) is the effective number of parameters and \(L_{\mathrm{train}}\) is the log-likelihood given the training data. In the limit of leave-one-out cross-validation, model choice based on using the Akaike Information Criterion with the training data and model choice based on the test log-likelihood are asymptotically equivalent. One of the advantages of using the test log-likelihood is that it works even when we cannot specify the effective number of parameters: for instance in a hierarchical mixture model.

Differences in the test log-likelihood can be interpreted directly as weight of evidence favouring one model over another. Reviewers however often ask for \(p\)-values to be provided. For a comparison of two nested models differing by \(k\) extra parameters, the asymptotic equivalence of the AIC and the test log-likelihood implies that a difference in test log-likelihood of \(\Delta L\) natural log units is equivalent to a likelihood ratio chi-square test statistic of \(2 \left( \Delta L + k \right)\).

Evaluating discrimination performance

For a binary classifier the discrimination performance is traditionally evaluated as the \(C\)-statistic (area under the ROC curve). A limitation of the \(C\)-statistic is that increments of the statistic are not easily interpretable as quantifying the information gained by adding a new predictor.

I have proposed reporting the expected information for discrimination (expected weight of evidence) instead, and plotting the distributions of weights of evidence favouring case over control status in case and control person-time intervals. One of the advantages of working in this framework is that the incremental contributions of independent variables to the expected information for discrimination are additive. Functions for calculating an plotting the distributions of weights of evidence are available in the R package wevid. This tutorial describes how to install it and how to use it. Below is a demonstration of its use on the test dataset.

## prepare test dataset with short person-time intervals
survdata.test <- with(direct.obs.test,
                                to.persontime(y, data.frame(x1, x2_), tobs, interval.short))
Xmatrix.test <- model.matrix(object= ~ x1 + x2_, data=survdata.test)

## use model fitted to training data to calculate hazard rates on test data
survdata.test[, hrate := hazardrate.poisson(model.poisson=poisson.model.surv,
                                            Xmatrix=Xmatrix.test)]
## calculate event probabilities from hazard rates 
survdata.test[, p := 1 - exp(-hrate * interval.short)]  


## to calculate the expected weight of evidence we need prior and posterior probabilities
## use training data to calculate prior probs 
survdata.test[, prior.probs := rep(mean(survdata.train.short$y), nrow(survdata.test))]

testdata.densities <- with(survdata.test,
                           Wdensities(y=y, posterior.p=p, p=prior.probs))

## Density with 2 mixture components chosen by BIC

pander(summary(testdata.densities), table.style="multiline",
       split.cells=c(5, 5, 5, 5, 5, 5, 5), split.table="Inf",
       caption="Failures directly observed")

Failures directly observed
Cases / controls	Crude C-statistic	Model-based C-statistic	Crude Λ (bits)	Model-based Λ (bits)	Test log-likelihood (nats)	log-likelihood after recalibration (nats)
134 / 66170	0.812	0.811	1.15	1.16	-864.9	-864.5

## calculate binomial log-likelihood of null model and fitted model
loglik.null <- with(survdata.test,
                     sum(y * log(prior.probs) + (1 - y) * log(1 - prior.probs)))

loglik.test <- with(survdata.test,
                     sum(y * log(p) + (1 - y) * log(1 - p)))

cat("Binomial log-likelihood given test data formatted as intervals of length",
    interval.short, "days:", loglik.null, "for null model,", loglik.test, "for full model\n")

## Binomial log-likelihood given test data formatted as intervals of length 28 days: -965.2995 for null model, -864.9292 for full model

Interval-censored data

Interval-censored survival data, when the time to failure is not observed directly but is known only to lie within an interval, often arise in epidemiology. An example is a screening programme for disease, where the screening examinations show only whether disease has arisen since the last examination. Assuming a Poisson likelihood for the hazard rate, the data can be modelled with a generalized linear model with a complementary log-log link function.

Modelling

To fit such a model to training data and evaluate its predictive performance on test data, we have to specify an offset term to allow for the time between examinations. We fit the model to training data with an offset term, use this model to compute the linear predictor in the test data, then calculate the failure probability for each time interval by adding the offset term and inverting the link function.

## fit model with predictor variable x
cloglog.model <- with(survdata.train,
                      glm(formula = y ~ offset(log(tobs.icens)) + x1 + x2_,
                          family=binomial(link=cloglog)))
print(summary(cloglog.model))

## 
## Call:
## glm(formula = y ~ offset(log(tobs.icens)) + x1 + x2_, family = binomial(link = cloglog))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8499  -0.2349  -0.1435  -0.0890   3.7624  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.13393    0.10405 -97.393   <2e-16 ***
## x1            1.50410    0.07751  19.406   <2e-16 ***
## x2_2         -0.18978    0.13336  -1.423    0.155    
## x2_3         -0.18243    0.31148  -0.586    0.558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2576.5  on 11089  degrees of freedom
## Residual deviance: 2144.4  on 11086  degrees of freedom
## AIC: 2152.4
## 
## Number of Fisher Scoring iterations: 7

Weibull distribution

With long intervals, we might wish to examine whether a Weibull distribution might fit the interval-censored data better.

surv.model.ic <- ic_par(Surv(time=l.time, time2=u.time, type="interval2") ~ x1 + x2_,
                        model="ph",
                        dist="exponential",
                        data=survdata.train)
print(summary(surv.model.ic))

## 
## Model:  Cox PH
## Dependency structure assumed: Independence
## Baseline:  exponential 
## Call: ic_par(formula = Surv(time = l.time, time2 = u.time, type = "interval2") ~ 
##     x1 + x2_, data = survdata.train, model = "ph", dist = "exponential")
## 
##           Estimate  Exp(Est) Std.Error  z-value      p
## log_scale  10.4300 3.396e+04   0.10210 102.2000 0.0000
## x1          1.5040 4.500e+00   0.07656  19.6500 0.0000
## x2_2       -0.1898 8.271e-01   0.13280  -1.4290 0.1530
## x2_3       -0.1824 8.332e-01   0.30930  -0.5898 0.5553
## 
## final llk =  -1072.206 
## Iterations =  36

Quantifying predictive performance

To quantify predictive performance with interval-censored survival data, we cannot resort to reformatting the dataset as many short person-time intervals of equal length, because the failure times are not directly observed. Although we cannot therefore use standard methods for evaluating the performance of a binary classifier, we can still calculate the weight of evidence. For this we calculate the prior log odds from a null model containing only the intercept and the offset term for the length of the interval. We subtract this from the posterior log odds obtained from prediction with a model containing the predictor variables, to obtain the weight of evidence. We can then report the average weight of evidence as a summary measure of predictive performance, and the distributions of the weight of evidence in cases and noncases to examine the usefulness of the predictor as a risk stratifier.

## fit null model with cloglog link function 
cloglog.baseline <- with(survdata.train,
                         glm(formula = y ~ offset(log(tobs.icens)),
                             family=binomial(link=cloglog)))
# intercept parameter is log hazard rate
intercept <- summary(cloglog.baseline)$coefficients[1]
prior.failure.prob <- 1 - exp(-exp(intercept) * survdata.test$tobs)

## linear predictor includes the offset term
eta.pred.test <- predict.glm(object=cloglog.model, newdata=survdata.test, type="link")
cloglog.failure.prob <- 1 - exp(-exp(eta.pred.test))

## compute test log-likelihood
test.loglik <- sum(survdata.test$y * log(cloglog.failure.prob) +
                   (1 - survdata.test$y) * log(1 - cloglog.failure.prob))

## compute average weight of evidence
pander(summary(Wdensities(y=survdata.test$y,
                          posterior.p=cloglog.failure.prob,
                          prior.p = prior.failure.prob)),
       table.style="multiline",
       split.cells=c(5, 5, 5, 5, 5, 5, 5), split.table="Inf",
       caption="Interval-censored failure times")

## Density with 2 mixture components chosen by BIC

Interval-censored failure times
Cases / controls	Crude C-statistic	Model-based C-statistic	Crude Λ (bits)	Model-based Λ (bits)	Test log-likelihood (nats)	log-likelihood after recalibration (nats)
134 / 66170	0.812	0.798	1.79	1.04	-866	-864.5

Notes on fitting survival models and evaluating predictive performance (work in progress, being updated)

Paul McKeigue

9 December 2020