gbm
facilitates the creation of boosted Cox proportional
hazards models, a particularly useful feature when dealing with survival
data. The package can handle two types of survival object as the
response, namely a right censored or counting survival object. Both of
these objects can be created using the Surv
command found
in the survival
package.
Right censored survival data consists of a time to event number and
the event indicator - 0 if no event has taken place and 1 if the event
has happened. On the other hand, counting survival data contains start
and stop times along with an event indicator again indicating if an
event has taken place in that period. The data may be organised into
strata and this should be passed to gbm_dist
on creation of
the CoxPHGBMDist
object - see the “Model Specific
Parameters” vignette for more details. The dataset considered here is
provided by the survival
package.
Now to create the underlying boosted model the training parameters
need to be defined and gbmt
called. In this instance the
data has observation ids associated with it and so it is necessary to
create specific GBMTrainParams
objects rather than relying
on the defaults.
# Set-up training parameters
params_right_cens <- training_params(num_trees = 2000, interaction_depth = 3,
id=right_cens$id,
num_train=round(0.5 * length(unique(right_cens$id))) )
params_start_stop <- training_params(num_trees = 2000, interaction_depth = 3,
id=start_stop$id,
num_train=round(0.5 * length(unique(start_stop$id))) )
# Call to gbmt
fit_right_cens <- gbmt(Surv(tstop, status)~ age + sex + inherit +
steroids + propylac, data=right_cens,
distribution = right_cens_dist,
train_params = params_right_cens, cv_folds=10,
keep_gbm_data = TRUE)
fit_start_stop <- gbmt(Surv(tstart, tstop, status)~ age + sex + inherit +
steroids + propylac, data=start_stop,
distribution = start_stop_dist,
train_params = params_start_stop, cv_folds=10,
keep_gbm_data = TRUE)
# Plot performance
best_iter_right <- gbmt_performance(fit_right_cens, method='test')
best_iter_stop_start <- gbmt_performance(fit_start_stop, method='test')
During the fitting process the original strata vector is updated in
the following way. When the data is split into a training and validation
set the strata vector is also split. The strata vector is then updated
so as represent the cumulative count of the number of observations in
each stratum in the training and validation sets. The vector is padded
with NAs
so it is of the same length as the original strata
vector provided and such that the validation set cumulative strata sums
are separated from the training set strata counts by the appropriate
amount.
The original strata vector is stored within the GBMFit
object and can be accessed as follows:
fit$distribution$original_strata_id
. The data in the
original_strata_id
field is used to recreate the correct
strata when performing additional iterations using
gbm_more
.
The ties
and prior_node_coeff_var
parameters may also be specified on construction of the
CoxPHGBMDist
object. The former is a string specifying the
method by which the algorithm deals with tied event times. This may be
set to either “breslow” or “efron” depending on your preference, with
the latter being the default. The role of the
prior_node_coeff_var
parameter is slightly more subtle and
complex. When fitting a boosted tree, the optimal predictions of the
terminal nodes must be set. These predictions determine the predictions
made by the GBMFit
object. The role of
prior_node_coeff_var
is to ensure that the predictions are
finite and it does this by acting as a regularization for the terminal
node predictions. It should be a finite positive double and is by
default set to a 1000. An exact description of its role in the
underlying algorithm is described in the next section.
# Example using Breslow and Efron tie-breaking
# Create data
require(survival)
set.seed(1)
N <- 3000
X1 <- runif(N)
X2 <- runif(N)
X3 <- factor(sample(letters[1:4],N,replace=T))
mu <- c(-1,0,1,2)[as.numeric(X3)]
f <- 0.5*sin(3*X1 + 5*X2^2 + mu/10)
tt.surv <- rexp(N,exp(f))
tt.cens <- rexp(N,0.5)
delta <- as.numeric(tt.surv <= tt.cens)
tt <- apply(cbind(tt.surv,tt.cens),1,min)
# throw in some missing values
X1[sample(1:N,size=100)] <- NA
X3[sample(1:N,size=300)] <- NA
# random weights if you want to experiment with them
w <- rep(1,N)
data <- data.frame(tt=tt,delta=delta,X1=X1,X2=X2,X3=X3)
# Set up distribution objects
cox_breslow <- gbm_dist("CoxPH", ties="breslow", prior_node_coeff_var=100)
cox_efron <- gbm_dist("CoxPH", ties="efron", prior_node_coeff_var=100)
# Define training parameters
params <- training_params(num_trees=3000, interaction_depth=3, min_num_obs_in_node=10,
shrinkage=0.001, bag_fraction=0.5, id=seq(nrow(data)),
num_train=N/2, num_features=3)
# Fit gbm
fit_breslow <- gbmt(Surv(tt, delta)~X1+X2+X3, data=data, distribution=cox_breslow,
weights=w, train_params=params, var_monotone=c(0, 0, 0),
keep_gbm_data=TRUE, cv_folds=5, is_verbose = FALSE)
fit_efron <- gbmt(Surv(tt, delta)~X1+X2+X3, data=data, distribution=cox_efron,
weights=w, train_params=params, var_monotone=c(0, 0, 0),
keep_gbm_data=TRUE, cv_folds=5, is_verbose = FALSE)
# Evaluate fit
plot(gbmt_performance(fit_breslow, method='test'))
legend("topleft", c("training error", "test error", "optimal iteration"),
lty=c(1, 1, 2), col=c("black", "red", "blue"))
plot(gbmt_performance(fit_efron, method='test'))
legend("topleft", c("training error", "test error", "optimal iteration"),
lty=c(1, 1, 2), col=c("black", "red", "blue"))
The gbm
algorithm works to estimate, via tree boosting,
the function \(f(\textbf{x})\), which
maps covariates (\(\textbf{x}\)) to the
response variable y - in this case the event indicator. For
CoxPH
, the algorithm calculates both the partial log
likelihood and martingale residuals (\(\textbf{m}\)) using the following approach.
The algorithm walks backwards in time until it encounters the “stop”
time of an observation. When this happens the weighted risk associated
with that observation, \(\omega_i
e^{f(\textbf{x}_i)}\), is added to the total cumulative hazard:
\(S = \sum \omega_j
e^{f(\textbf{x}_j)}\), which is initialized at \(0\). Continuing backwards in time when we
reach a time before an observation was in the study, that is the
algorithm leaves the associated time segment (start, stop], the
observation’s contribution to the cumulative hazard is subtracted off.
The algorithm is robust to overflow/underflows occurring in \(e^{f(\textbf{x}_i)}\) by subtracting a
constant off of the risk score. This constant drifts to ensure overflow
does not occur.
This algorithm deals with tied event times using either the Breslow or Efron approximations. The method used is specified by the user but in the event of tied deaths, it defaults to the Efron approximation. It also allows for the introduction of strata and start as well as stop times for each observation, see the previous Sections.
As well as calculating the partial log likelihood the algorithm also calculates the martingale residuals. The risk scores are related to the covariate matrix, \(\mathbb{X}\), via: \[ f(\textbf{x}_i) = (\mathbb{X}\boldsymbol{\beta})_i. \qquad (1) \] The derivative of the partial log likelihood, \(l(\boldsymbol{\beta})\), with respect to the parameter vector \(\boldsymbol{\beta}\) is related to the martingale residuals through: \[\frac{\partial}{\partial \boldsymbol{\beta}} l(\boldsymbol{\beta}) = \mathbb{X}^{T} \textbf{m}. \qquad (2) \] Defining the loss function as the negative of the partial log likelihood then using the chain rule in combination with Equation (1) the residuals are given by: \[ z_i = -\frac{\partial}{\partial f(\textbf{x}_{\textit{i}})}\Psi(\textit{y}_{\textit{i}},f(\textbf{x}_\textit{i})) = (\mathbb{X}\mathbb{X}^{T}\textbf{m})_i. \qquad (3)\]
At this point the covariate matrix should only decide what splits the tree will make thus covariate matrix in Equation (3) is free to be set to the identity matrix and so: \[ z_i = \textbf{m}_i. \qquad (4)\]
Finally, the updated implementation calculates the optimal terminal
node predictions in the following way. Looping over the bagged
observations in the terminal node of interest the expected number of
events is given by: \(\sum_i \max(0.0, y_i -
\textbf{m}_i) + 1/c\). The constant \(c\) represents the prior on the baseline
number of events that occur within a given terminal node; it can be set
on construction of the CoxPHGBMDist
through the
prior_node_coeff_var
parameter. From this the terminal node
prediction is given by: \[ \log(\frac{\sum_i
y_i + 1/c}{\sum_i \max(0.0, y_i - \textbf{m}_i) + 1/c}). \qquad
(5)\]
If this prior_node_coeff_var
is set incorrectly, i. e.
to not a finite positive double, then the predictions of the fitted
model can become non-sensical.