The gbm
package implements gradient tree boosting for a
wide variety of statistical models and loss functions; including those
commonly used in statistics, such as the Cox model, but not usually
associated with boosted models. When presented with data, the
gbm
package offers the user the ability to build predictive
models and explore the influence of different variables on the response,
akin to a data mining or exploration task.
To get started a user must:
data.frame
.Once these steps have been completed, a user can fit their
gbm
model by a call to gbmt
and subsequently:
evaluate its performance, make predictions, fit additional trees and
plot the relative influence of the various predictor variables
To begin, this vignette will work with simulated data where the
response obeys a Bernoulli
distribution and is determined
by 3 predictor variables. Every row of the data corresponds to an
individual observation and to these observations random weights are
assigned. These weights determine the importance of the associated
observation in the fit. At this stage it is also assumed that the fit
will be made directly to the response and no offset is
supplied. If the offset vector for each observation was non-zero the
gbm
fit would be to the
observation response + offset
rather than the bare
response. Currently the package supports responses which either consist
of: numerics, factors (i. e. Bernoulli
) or are a survival
object (more specifically a counting or right centered survival object)
in the case of Cox proportional hazards models. NB:
Predictor variables may have missing data (specified by NA
)
but the responses must have no missing observations.
# create some binomial response data - N observations
N <- 1000
X1 <- runif(N)
X2 <- runif(N)
X3 <- factor(sample(letters[1:4],N,replace=T))
mu <- c(-1,0,1,2)[as.numeric(X3)]
p <- 1/(1+exp(-(sin(3*X1) - 4*X2 + mu)))
Y <- rbinom(N,1,p) # response
# Normalized weights for each observation
w <- rexp(N)
w <- N*w/sum(w)
# Offset is set to 0
offset <- rep(0, N)
# Collate data
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3)
The gbm
package has a number of distributions available
to user. To view what distributions are currently available simply call
available_distributions()
.
## [1] "AdaBoost" "Bernoulli" "CoxPH" "Gamma" "Gaussian" "Huberized"
## [7] "Laplace" "Pairwise" "Poisson" "Quantile" "TDist" "Tweedie"
From the list of available distributions a default
GBMDist
object for each distribution can be created with
using the function gbm_dist
and passing the desired
distribution’s name as follows:
Certain distributions have distribution specific parameters, such as
the number of degrees of freedom associated with the distribution. These
parameters are passed as arguments to gbm_dist
and how this
is done is described in another vignette entitled
“model-specific-parameters”.
Once the data has been imported and the distribution object created,
the next task is to specify the training parameters for the
gbm
package. These parameters are passed to
gbmt
in the form of a GBMTrainParams
object.
This S3 object can be created using the training_params
constructor and passing it the appropriate data to the fields. The named
parameters in the constructor are as follows:
num_trees
: the number of iterations to use in the
fitting.interaction_depth
: the max number of terminal nodes in
each tree - the larger this is the more likely overfitting is to occur -
depths of ~ 3-10 are usually recommended.min_num_obs_in_node
: the minimum number of observations
that a node must have to be considered in fitting a tree. The default of
10 observations which is usually adequate but if your dataset is small
then this could be reduced.shrinkage
: acts as regularization for additional
iterations - the smaller the shrinkage generally the better the
performance of the fit. However, smaller shrinkage implies that the
number of trees may need to be increased to achieve a certain
performance. In practice the default of 0.001
usually
suffices.bag_fraction
: the fraction of the training observations
randomly sub-sampled to fit a tree in each iteration. This has the
effect of reducing the variance of the boosted model. The default
bag_fraction=0.5
is recommended.num_train
: the number of observations to be used in
training the model. This defaults to the minimum value such that the
terminal nodes of a fitted tree can be created.id
: a sequence of integers specifying the id of each
observation in the data. This defaults to
seq_len(num_train)
.num_features
: the number of features randomly selected,
on each iteration, from the data when growing a tree.Once the parameters have been decided the appropriate object is created. For further details on how these parameters and what they represent, please view: The Elements of Statistical Learning by Friedman et al.
Before fitting our boosted model there are some other parameters to
consider. The monotonicity of each predictor variable with the outcome
variable can be specified with through the var_monotone
parameter. In this case the relationship is assumed to be arbitrary and
the parameter is unspecified. The names of the predictor variables may
also be specified via var_names
.
Other parameters relating to cross-validation of the model may be
provided. Firstly, the number of folds to be used in cross-validation is
set by cv_folds
, by default no cross-validation is done by
gbmt
. An optional vector specifying what fold each
observation belongs to can also be specified through
fold_id
and if the distribution selected is
Bernoulli
the cross-validation can be stratified across the
response.
The core algorithm can be parallelized across numerous threads by
passing the output of gbmParallel()
to the
par_details
handle. In this example, more than one thread
for processing is not required and so the default is used.
With the data defined, the distribution object created, the training
parameters and additional parameters specified, it is now time to fit
the model. This is done by passing these objects to gbmt
along with the model formula.
# Create a gbm fit
fit <- gbmt(Y ~ X1 + X2 + X3, distribution=bern_dist, data=data, weights = w,
offset=offset, train_params = train_params,
cv_folds=5, keep_gbm_data=TRUE)
The keep_gbm_data
flag indicates that the outputted
GBMFit
object will contain the data passed to the fit
within a GBMData
object.
The process outlined above is slightly cumbersome. Thankfully,
gbmt
has numerous default behaviours that are very useful.
Firstly, the distribution will be “guessed” if none is provided -
currently it can identify Bernoulli and Cox models, otherwise it
defaults to Gaussian. The “weights” parameter defaults to be uniform
across all the data rows and as the offset also defaults to 0 across all
data rows. The training parameters also specify 2000 trees, an
interaction depth of 3, identify one row per observation, use half the
available data in training and bag half the training data as well as use
all of the predictors in the fitting. Moreover, the weights are
normalized by the routine if the distribution is not “Pairwise”.
# A default fit to gaussian data
# Create data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- factor(sample(letters[1:4],N,replace=T))
X4 <- ordered(sample(letters[1:6],N,replace=T))
X5 <- factor(sample(letters[1:3],N,replace=T))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# create a bunch of missing values
X1[sample(1:N,size=100)] <- NA
X3[sample(1:N,size=300)] <- NA
# Gaussian data
gauss_data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# Perform a cross-validated fit
gauss_fit <- gbmt(Y ~ X1 + X2 + X3 + X4 + X5 + X6,
data=gauss_data, cv_folds =5, keep_gbm_data = TRUE)
A fitted model contains a number of measures of its performance.
Using these measures, the optimal iteration of a gbm
fit
can be identified. The package offers three methods for identifying the
optimal iteration of the fit, these are: 'test'
,
'cv'
and 'OOB'
. The 'test'
method
identifies the optimal iteration using the performance of the fit on a
set-aside independent test set, this is only available if the number of
observations used to fit the model is less than the number of
observations in the original data. The 'cv'
method
determines the best iteration number using the cross-validation error
provided and finally 'OOB'
selects the best iteration using
the out-of-bag error estimate. Again this is only available if
bag_fraction > 0
and it will under-estimate the optimal
number of iterations in the fit. The optimal iteration number as well as
a plot of the performance can then be obtained through a call to
gbmt_performance
. In these performance plots the
cross-validation error, test set error and out-of-bag improvement are
represented by green, red and blue lines respectively. The error on the
training set is shown as a black line and the optimal iteration is
marked with a dashed blue line.
# Use gaussian fit and determine the performance
# red line is testing error
# green line is cv error
# blue line is OOB error
best_iter_test <- gbmt_performance(gauss_fit, method='test')
best_iter_cv <- gbmt_performance(gauss_fit, method='cv')
best_iter_oob <- gbmt_performance(gauss_fit, method='OOB')
## Warning in best_iter_out_of_bag(gbm_fit_obj): OOB generally underestimates the
## optimal number of iterations although predictive performance is reasonably
## competitive. Using cv_folds>1 when calling gbm usually results in improved
## predictive performance.
After evaluating the performance it may be the case that the optimal
number of iterations is the num_trees
parameter set. In
this instance additional trees may want to be fitted to the original
dataset; this will not alter the cross-validation fit but will update
the test set and OOB errors.
# Fit additional trees to gaussian fit and recalculate optimal number of iterations
gauss_fit_more <- gbm_more(gauss_fit, num_new_trees = 5000) # This is a large number of
## Warning in gbm_more(gauss_fit, num_new_trees = 5000): gbm_more is incompatible
## with cross-validation; losing cv results.
With the model fitted and the optimal number of iterations
determined, the user can now make predictions on other datasets using
the gbm
package’s predict
functionality.
# Given additional data - of the same shape as our gaussian example
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- factor(sample(letters[1:4],N,replace=TRUE))
X4 <- ordered(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
Y <- X1**1.5 + 2 * (X2**.5) + mu
Y <- Y + rnorm(N,0,sigma)
data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# Then predictions can be made
preds <- predict(gauss_fit_more, newdata=data2, n.trees=best_iter_test_more)
gbm
also offers a measure of the relative influence of
each predictor in the fitted model. The influence of a variable \(x_j\) is determined by the sum across all
iterations of the improvements of a tree fit when split on that
variable.
# Relative influence of predictors - both examples
print(relative_influence(gauss_fit_more, best_iter_test_more))
## X1 X2 X3 X4 X5 X6
## 26365.267 91251.316 123736.475 4548.206 1395.813 9628.245
## X1 X2 X3
## 1836.329 13560.888 11647.626
As well as calculating the relative influence from the fitted trees,
gbm
also offers the ability to calculate the predictions of
specific variables and make the appropriate marginal plots using
plot
.
# Examine the marginal effects of the first two variables - gaussian example
# Shows the fitted model predictions as a function of 1st two predictors
# with all the others integrated out
plot(gauss_fit_more, var_index = 1:2)
Finally there are some additional useful functions within the
gbm
package to remember. A GBMFit
object can
be summarised and printed using summary
and
print
respectively. Each individually fitted tree can be
“prettified” using pretty_gbm_tree
and the loss for a given
distribution can be calculated using the loss
S3 method, a
GBMDist
object and the appropriate data.