XGBoost model is a supervised machine learning algorithm that takes in the training data and constructs a model that predicts the outcome of new data instances.

XGBoost has gained a lot of popularity in the machine learning community due to its ability to train versatile model with speed and quality performance. It’s an implementation of gradient boosted decision trees which are constructed for speed and performance.

For an in-depth introduction visit this link for more technical details.

Here we cover a gentle introduction of XGBoost model using python scikit-learn package.

By the end of this tutorial you will have learnt:

  • XGBoost installation for python use
  • XGBoost data preparation and model training
  • XGBoost model prediction
  • XGBoost hyperparameters tuning
A decision tree is a non-parametric supervised learning method that predicts the value of a target variable by learning simple decision rules inferred from the data features. It can be used for classification (categorical variables) or regression (continuous variables) applications.

The following figure gives a decision tree for the well known toy problem of predicting whether a tennis game would be played or not, depending on the weather conditions.

schermata-2016-12-30-alle-13-05-13A decision tree can be seen as a graph consisting of a finite, non-empty set of nodes (vertices) and a set of edges. There can be no cycles within the graph and there is exactly one node, called the root, which no edges enter. In the previous example, the root node has the label “Outlook”; the nodes with “No”/”Yes” labels are leaf (or terminal) nodes. Root and other internal nodes hold a test (split-test) over a given data set attribute (or a set of attributes), and the edges correspond to the possible outcomes of the test. Leaf nodes can either hold class labels (classification), continuous values (regression), (non-) linear models (regression), or even models produced by other machine learning algorithms.
Starting from the root, one has to follow the edges according to the results of the tests over the attributes. When reaching a leaf node, the information it contains is responsible for the prediction outcome.

Two important definitions regarding decision trees are the concepts of depth and breadth. The average number of layers (levels) from the root node to the terminal nodes is referred to as the average depth of the tree. The average number of internal nodes in each level of the tree is referred to as the average breadth of the tree. Both depth and breadth are indicators of tree complexity, that is, the higher their values are, the more complex the corresponding decision tree is.

A major issue in top-down induction of decision trees is which attribute(s) to choose to execute the split in each not-leaf node. The problem is to choose the attribute (for univariate decision trees) or a good combinations of attributes (for multivariate decision trees) with good discriminatory power. This choice is based on a information gain heuristic1 where the selected attribute reduces entropy (i.e. unpredictability of classification) and consequently the number of splits required to reach a leaf node.

Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.

Generally speaking, ensembles are a combination of multiple base models for which the final classification depends on the combined outputs of individual models. Classifier ensembles have been shown to produce better results than single models, if the classifiers in the ensemble are accurate and diverse. There are different methods to create these ensembles. One popular method is to present different problem representations (subset of training set) to a classifier algorithm. We can train many classifiers using this process. These classifiers are different because they are trained on various problem representations. Their decisions are combined to get the final decision. This final decision is generally better than a single classifier.

In this post we consider ensembles of decision tree classifiers that are very popular and effective. An ensemble of decision tree classifiers is also named decision forest.

Gradient Boosting Machine (GBM) is one of several methods that have been proposed to create decision forests. At a high level, the way GBM works is by starting with a rough prediction and then building a series of decision trees, with each tree in the series trying to correct the prediction error of the tree before it. This technique is called boosting because we expect an ensemble to work much better than a single estimator. Gradient boosting builds an ensemble of trees one-by-one, then the predictions of the individual trees are summed. For example, if an ensemble has 3 trees the prediction D(\bar{x}) of that ensemble is:

D(\bar{x}) = d_{tree_1}(\bar{x})+d_{tree_2}(\bar{x})+d_{tree_3}(\bar{x})

The next tree (tree_4) in the ensemble should complement well the existing trees and minimize the training error of the ensemble. In the ideal case we’d happy to have:

D(\bar{x}) + d_{tree_4}(\bar{x}) = f(\bar{x})

where f(\cdot) in the target function. Obviously the previous equation depicts an ideal condition and the reality will be done of approximations.

The contribution of each tree to the previous sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate (usually represented by the small greek letter eta \eta). Empirical evidence suggests that small values of learning rate favor better test error. The effect is that learning is slowed down, in turn require more trees to be added to the model, in turn taking longer to train, providing a configuration trade-off between the number of trees and learning rate. The learning rate corresponds to how quickly the error is corrected from each tree to the next and is a simple multiplier 0\textless\eta\leq1.

XGBoost installation

Assuming you have a working SciPy environment, XGBoost can be installed easily using pip.

sudo pip install xgboost

You can learn more about how to install XGBoost for different platforms on the XGBoost Installation Guide. For up-to-date instructions for installing XGBoost for Python see the XGBoost Python Package. For reference, you can review the XGBoost Python API reference.

In this tutorial we are going to use the commonly used Indians onset of diabetes dataset. This dataset is comprised of 8 input variables that describe medical details of patients and one output variable to indicate whether the patient will have an onset of diabetes within 5 years. You can learn more about this dataset on the UCI Machine Learning Repository website. This is a good dataset for a first XGBoost model because all of the input variables are numeric and the problem is a simple binary classification problem. It is not necessarily a good problem for the XGBoost algorithm because it is a relatively small dataset and an easy problem to model. Download this dataset and place it into your current working directory with the file name “diabetes.csv“.

Data preparation and loading using scikit-learn

In this section we will load the data from file and prepare it for use for training and evaluating an XGBoost model.

We will start off by importing the classes and functions we intend to use in this tutorial.

import numpy
import xgboost
from sklearn import cross_validation
from sklearn.metrics import accuracy_score

Next, we can load the CSV file as a NumPy array using the NumPy function loadtxt().

dataset = numpy.loadtxt('diabetes.csv', delimiter = ",")

We must separate the columns (attributes or features) of the dataset into input patterns (X) and output patterns (Y). We can do this easily by specifying the column indices in the NumPy array format.

features = dataset[:,0:8]
labels = dataset[:,8]

Finally, we must split the X and Y data into a training and test dataset. The training set will be used to prepare the XGBoost model and the test set will be used to make new predictions, from which we can evaluate the performance of the model.

For this we will use the train_test_split() function from the scikit-learn library. We also specify a seed for the random number generator so that we always get the same split of data each time this example is executed.

test_size = 0.33
X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, labels, test_size = test_size)

We are now ready to train our model.

XGBoost Model training

XGBoost provides a wrapper class to allow models to be treated like classifiers or regressors in the scikit-learn framework. This means we can use the full scikit-learn library with XGBoost models.

The XGBoost model for classification is called XGBClassifier. We can create and fit it to our training dataset. Models are fit using the scikit-learn API and the model.fit() function.

Parameters for training the model can be passed to the model in the constructor. Here, we use the sensible defaults.

model = xgboost.XGBClassifier()
model.fit(X_train, y_train)

You can see the parameters used in a trained model by printing the model, for example:

print model

or using the function get_xgb_params():

print model.get_xgb_params()

You can learn more about the defaults for the XGBClassifier and XGBRegressor classes in the XGBoost Python scikit-learn API.

We are now ready to use the trained model to make predictions.

XGBoost Model prediction assessment

We can make predictions using the fit model on the test dataset. To make predictions we use the scikit-learn function model.predict().

By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class. We can easily convert them to binary class values by rounding them to 0 or 1.

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

Now that we have used the fit model to make predictions on new data, we can evaluate the performance of the predictions by comparing them to the expected values. For this we will use the built in accuracy_score() function in scikit-learn.

accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

In summary

Putting it all together the following is the full code, well commented for easier modification and tweaking.

# first XGBoost model
import numpy
import xgboost
from sklearn import cross_validation
from sklearn.metrics import accuracy_score

# set the seed of random number generator, which is useful for creating simulations 
# or random objects that can be reproduced.
import random
SEED = 3
random.seed(SEED)
numpy.random.seed(SEED)

# load data
dataset = numpy.loadtxt('diabetes.csv', delimiter = ",")

# split data into features and labels
features = dataset[:,0:8]
labels = dataset[:,8]

# split data into train and test sets
test_size = 0.33
X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, labels, test_size = test_size, random_state = SEED)

# fit model
model = xgboost.XGBClassifier(seed = SEED)
model.fit(X_train, y_train)

# printing the model for visualization
print model

# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Running this example produces the following output.

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=3, silent=True, subsample=1)
Accuracy: 78.35%

The accuracy may be different in your pc.

Hyperparameters tuning

Algorithm tuning is a final step in the process of applied machine learning before presenting results. It is sometimes called hyperparameter optimization where the algorithm parameters are referred to as hyperparameters whereas the coefficients found by the machine learning algorithm itself are referred to as parameters or weights. Optimization suggests the search-nature of the problem.

In the previous code block we printed out the model and got useful details about parameters.

{
'reg_alpha': 0, 
'colsample_bytree': 1, 
'silent': 1, 
'colsample_bylevel': 1, 
'scale_pos_weight': 1, 
'learning_rate': 0.1, 
'missing': None, 
'max_delta_step': 0, 
'base_score': 0.5, 
'n_estimators': 100, 
'subsample': 1, 
'reg_lambda': 1, 
'seed': 3, 
'min_child_weight': 1, 
'objective': 'binary:logistic', 
'max_depth': 3, 
'gamma': 0
}

We start with the shrinkage hyperparameter learning_rate. Its default value is 0.1 and the range is [0,1]. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. In practice one typically chooses the learning_rate beforehand and varies the number of trees n_estimators with respect to the chosen shrinkage. The most frequently used approach to deal with this trade-off relies on the cross-validation procedure.

Cross-validation provides means for testing the model on withheld portions of data, while still using all of the data at the processing stages. First, the shrinkage parameter, the maximum number of estimators and the cross-validation parameter nfold, corresponding to the number of validation folds, are specified. The data is then partitioned into nfold disjoint non-overlapping subsets. Afterwards, for each of the nfold subsets of the data, one of them is set aside as the validation set and the others are used for fitting the model. The fitted model is then tested on the validation set to produce the held-out estimates of the predictive performance. At last, the validation performance is aggregated from each of the folds, for example, by averaging the validation set performances. This aggregated measure serves as estimate of model generalization on the validation set.

XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required. Briefly, we have:

import numpy
from xgboost.sklearn import XGBClassifier

# set the seed of random number generator, which is useful for creating simulations 
# or random objects that can be reproduced.
import random
SEED=3
random.seed(SEED)
numpy.random.seed(SEED)

# load data
dataset = numpy.loadtxt('diabetes.csv', delimiter = ",")

# split data into features and labels
features = dataset[:,0:8]
labels = dataset[:,8]

# 5 fold cross validation is more accurate than using a single validation set
cv_folds = 5

early_stopping_rounds = 50

model=XGBClassifier(seed = SEED)

xgb_param = model.get_xgb_params()
xgtrain = xgboost.DMatrix(features, label=labels)
cvresult = xgboost.cv(xgb_param, xgtrain, num_boost_round = 1000, nfold = cv_folds, metrics = 'auc', early_stopping_rounds = early_stopping_rounds, seed = SEED)

print cvresult

print ("Optimal number of trees (estimators) is %i" % csresult.shape[0])

The output is:

    test-auc-mean  test-auc-std  train-auc-mean  train-auc-std
0        0.802169      0.038329        0.837978       0.005065
....
79       0.826352      0.019820        0.952760       0.003617
Optimal number of estimators is 80

With the early_stopping_rounds = 50 and setting num_boost_round = 1000, the model will train until the validation score stops improving. The maximum number of trainings (rounds) is 1000. Validation error needs to decrease at least every 50 rounds to continue training. This procedure is also named the validation-based early stopping2. The metric watched in cross validation for early stopping is AUC (Area under curve) metrics = 'auc'.

Assuming the learning_rate equals to 0.1, we got that the optimal number of estimators is 80. When required, by increasing the learning rate, we can decrease the number of estimators.

Next step is about tree-specific parameters (parameters influencing the tree shape):

  • max_depth: maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. This should be between 3-10. 5 is a good starting point. Default: 6, range: [1, \infty ]
  • min_child_weight: minimum number of instances needed for each leaf. This parameter influences the partioning process: the larger it is, the lower will be the partioning (leaf nodes will not deeply located) and the more conservative the algorithm will be3. Set it larger to prevent overfitting. Default values is 1, smaller values are required with highly imbalanced class problem. Range: [0, \infty ]
  • gamma: in a decision tree a node is split only when the resulting slit gives a positive reduction in the loss function4. Gamma specifies the minimum loss reduction required to make a split and makes the algorithm conservative. Default is 0. A smaller value like 0.1-0.2 can also be chosen for starting. The values can vary depending on the loss function. Range: [0, \infty ]
  • subsample: for each tree is randomly selected a fraction of observations. This parameters define this fraction. 0.8 is a commonly used start value. Typical values range between 0.5-0.9. Default: 1, range: (0,1]
  • colsample_bytree: the number of attributes to consider while searching for a best split. These will be randomly selected. As a thumb-rule, square root of the total number of attrbiutes works great but we should check upto 30-40% of the total number of features. Higher values can lead to over-fitting but depends on case to case. Typical values range between 0.5-0.9. Default: 1, range: (0,1]
  • scale_pos_weight: for unbalanced classes it helps to introduce a re-balancement, in particular, a typical value to consider is sum(negative cases) / sum(positive cases). See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3. Default: 1

Tuning can be concretely executed using the sklearn GridSearchCV. Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

The code block below evaluates different max_depth and min_child_weight values on the diabetes dataset. This is a two-dimensional grid search.

from sklearn.grid_search import GridSearchCV

# using features, labels, cv_folds, model from previous example

model.set_params(n_estimators = 80)

param_test1 = {
'max_depth': [2,3,4],
'min_child_weight': [1,2,3]
}

gsearch1 = GridSearchCV(estimator = model, param_grid = param_test1, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = cv_folds, verbose = 2)

gsearch1.fit(features,labels)

#print gsearch1.grid_scores_
print gsearch1.best_params_
print gsearch1.best_score_

The output is:

{'max_depth': 2, 'min_child_weight': 1}
0.83236338225

Here we have run 27 (3!) combinations. The suggested values are 2 for max_depth and 1 for min_child_weight.

Now lets tune gamma value using the parameters already tuned above. We will check it for 5 values.

# using features, labels, cv_folds, model from previous example

model.set_params(n_estimators = 80)
model.set_params(max_depth = 2)
model.set_params(min_child_weight = 1)

param_test2 = {
'gamma':[i/10.0 for i in range(0,5)]
}

gsearch2 = GridSearchCV(estimator = model, param_grid = param_test2, scoring = 'roc_auc',n_jobs = 4, iid = False, cv = cv_folds, verbose = 2)

gsearch2.fit(features,labels)

print gsearch2.best_params_
print gsearch2.best_score_

The output is:

{'gamma': 0.0}
0.83236338225

This shows that our original value of gamma 0.0 is the optimum one.

Now, it is a good idea to re-calculate the optimal number of estimators by using the new settings. This task is identical to that already described and the resulting value is 108 for n_estimators.

Now we complete the tuning of tree-specific parameters by checking different subsample and colsample_bytree value combinations.

# using features, labels, cv_folds, model, cvresult from previous example

model.set_params(n_estimators = 108)
model.set_params(max_depth = 2)
model.set_params(min_child_weight = 1)

param_test3 = {
'subsample' : [i/10.0 for i in range(6,11)],
'colsample_bytree' : [i/10.0 for i in range(6,11)]
}

gsearch3 = GridSearchCV(estimator = model, param_grid = param_test3, scoring = 'roc_auc',n_jobs = 4, iid = False, cv = cv_folds, verbose = 2)

gsearch3.fit(features,labels)

print gsearch3.best_params_
print gsearch3.best_score_

The output is:

{'subsample': 0.9, 'colsample_bytree': 0.6}
0.842484975542

We found 0.9 as the optimum value for subsample and 0.6 for colsample_bytree.

And finally we conclude this tuning session applying regularization to reduce overfitting.

Much of machine learning can be framed as optimization problems – there is some kind of loss/cost function which we want to optimize. Typically we are trying to find a set of parameters/weights \theta_i for our model which minimizes a non-negative cost function like:

J(\bar{\theta}) = \sum_{i=1}^{m}l(h_{\bar{\theta}}(x_i)-y_i)

where l(\cdot) is the loss function, \bar{\theta} is the vector of weights \theta_i and h_{\bar{\theta}}(\cdot) is the hypothesis function that approximates the target function \bar{y}=f(\bar{x})5. The target function is unknown to us.

Overfitting is a common problem that interferes with the attempt to develop accurate predictions. When overfitting occurs, h_{\bar{\theta}}(\cdot) not only explains pattern but also random noise in the training set. This effect leads to degradation of generalization ability of our model.

A common way to reduce overfitting is to add a regularization term to the previous equation:

J(\bar{\theta})=\sum_{i=1}^{m}l(h_{\bar{\theta}}(x_i)-y_i) + \lambda\cdot R(\bar{\theta})

R(\bar{\theta}) is a regularizer to penalize certain values of \bar{\theta} and \lambda is the regularization parameter. The regularization term imposes a special penalty on the loss function. R(\dot) is either the L_1 (Lasso6), L_2 (Ridge7) or any other norm. In particular, on models with large weights R = sum\thinspace of\thinspace squares\thinspace of\thinspace weights (L_2 regularization), or with a lot of non-zero weights R = sum\thinspace of\thinspace absolute\thinspace\thinspace values\thinspace of\thinspace weights) (L_1 regularization). If we are training decision tree, R(\cdot) can be the tree depth. It is possible to combine Ridge and Lasso linearly to get Elastic Net Regression (both squared and absolute value components will be included in the cost function). It also is a good idea to appropriately scale attributes, so that weights are penalized based on their predictive power and not their scale.

A value of \lambda that is too low might not do anything, and one that is too high might actually cause us to underfit the model and lose valuable information. It’s up to the user to find the sweet spot.

With XGBoost, reg_alpha (default: 0) is the regularization parameter for L1 regularization term and reg_lambda (default: 1) is the regularization parameter for L2 term.

We’ll tune reg_alpha value by checking different values.

# using features, labels, cv_folds, model from previous example

model.set_params(n_estimators = 108)
model.set_params(max_depth = 2)
model.set_params(min_child_weight = 1)
model.set_params(subsample=0.9)
model.set_params(colsample_bytree=0.6)

param_test4 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}

gsearch4 = GridSearchCV(estimator = model, param_grid = param_test4, scoring = 'roc_auc',n_jobs = 4, iid = False, cv = cv_folds, verbose = 2)

gsearch4.fit(features,labels)

print gsearch4.best_params_
print gsearch4.best_score_

The output is:

{'reg_alpha': 0.1}
0.843246680643

We got a better reg_alpha: 0.1.

In summary:

  • n_estimators=108
  • max_depth=2
  • min_child_weight=1
  • subsample=0.9
  • colsample_bytree=0.6
  • reg_alpha=0.1

Now, assuming the initial test set, we can apply this parametrization in the model and look at the impact on accuracy. The new accuracy is 78.74%. To evaluate how good is this result, we can check this comparison page.

Starting from the previous parametrization we can try to recalculate again the optimal number of estimators. The new value is 75 and it will affect subsample which will impact on reg_alpha. It’s very important to carefully control this chain-reaction of changes. With reference to the initial test set, after applying a Grid search on learning_rate parameter, we got an accuracy of 79.92% just by updating learning_rate to 0.325.

You can learn more about the meaning of each parameter and how to configure them on the XGBoost parameters page.

Conclusions

In this post you learnt how to develop and tune your first XGBoost model in Python.

The following is a list of the resources which I found most useful and would suggest for further reading.

  1. Heuristics are a way to use available data in a problem to provide guidance in finding a solution when an exhaustive search is not practical.
  2. Another very important purpose of cross validation is to avoid overfitting. It randomly divides training set into N folds. Then it sets each fold aside for testing and trains on the remaining data. Thus it runs your parameters N times and then returns you mean performance on training and testing sets along with respective standard deviations.
  3. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
  4. The job of loss/cost function L(\cdot,\cdot) is to tell us how “bad” a system’s prediction is in comparison to the truth. In particular, if \bar{y} is the truth and \hat{y} is the system’s prediction, then L(\hat{y},\bar{y}) is a measure of error.
  5. The set of all possibile hypothesis is know as hypothesis set \emph{H}(\cdot).
  6. https://en.wikipedia.org/wiki/Lasso_(statistics)
  7. https://en.wikipedia.org/wiki/Tikhonov_regularization

Posted by lorenzo

Full-time engineer. I like to write about data science and artificial intelligence.

Vuoi commentare?