XGBoost model is a supervised machine learning algorithm that takes in the training data and constructs a model that predicts the outcome of new data instances.

XGBoost has gained a lot of popularity in the machine learning community due to its ability to train versatile model with speed and quality performance. It’s an implementation of **gradient boosted decision trees** which are constructed for speed and performance.

For an in-depth introduction visit this link for more technical details.

Here we cover a gentle introduction of XGBoost model using python scikit-learn package.

By the end of this tutorial you will have learnt:

- XGBoost installation for python use
- XGBoost data preparation and model training
- XGBoost model prediction
- XGBoost hyperparameters tuning

**decision tree**is a non-parametric supervised learning method that predicts the value of a target variable by learning simple decision rules inferred from the data features. It can be used for

**classification**(categorical variables) or

**regression**(continuous variables) applications.

The following figure gives a decision tree for the well known toy problem of predicting whether a tennis game would be played or not, depending on the weather conditions.

A decision tree can be seen as a graph consisting of a finite, non-empty set of nodes (vertices) and a set of edges. There can be no cycles within the graph and there is exactly one node, called the **root**, which no edges enter. In the previous example, the root node has the label “Outlook”; the nodes with “No”/”Yes” labels are **leaf** (or terminal) **nodes**. Root and other internal nodes hold a test (**split-test**) over a given data set attribute (or a set of attributes), and the edges correspond to the possible outcomes of the test. Leaf nodes can either hold class labels (classification), continuous values (regression), (non-) linear models (regression), or even models produced by other machine learning algorithms.

Starting from the root, one has to follow the edges according to the results of the tests over the attributes. When reaching a leaf node, the information it contains is responsible for the prediction outcome.

Two important definitions regarding decision trees are the concepts of **depth** and **breadth**. The average number of layers (levels) from the root node to the terminal nodes is referred to as the average depth of the tree. The average number of internal nodes in each level of the tree is referred to as the average breadth of the tree. Both depth and breadth are indicators of tree complexity, that is, the higher their values are, the more complex the corresponding decision tree is.

A major issue in top-down induction of decision trees is which attribute(s) to choose to execute the split in each not-leaf node. The problem is to choose the attribute (for *univariate* decision trees) or a good combinations of attributes (for *multivariate* decision trees) with good discriminatory power. This choice is based on a **information gain heuristic**^{1} where the selected attribute reduces entropy (i.e. unpredictability of classification) and consequently the number of splits required to reach a leaf node.

Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an **ensemble**.

Generally speaking, ensembles are a combination of multiple base models for which the final classification depends on the combined outputs of individual models. Classifier ensembles have been shown to produce better results than single models, if the classifiers in the ensemble are accurate and diverse. There are different methods to create these ensembles. One popular method is to present different problem representations (subset of training set) to a classifier algorithm. We can train many classifiers using this process. These classifiers are different because they are trained on various problem representations. Their decisions are combined to get the final decision. This final decision is generally better than a single classifier.

In this post we consider **ensembles of decision tree** classifiers that are very popular and effective. An ensemble of decision tree classifiers is also named **decision forest**.

**Gradient Boosting Machine (GBM)** is one of several methods that have been proposed to create decision forests. At a high level, the way GBM works is by starting with a rough prediction and then building a series of decision trees, with each tree in the series trying to correct the prediction error of the tree before it. This technique is called **boosting** because we expect an ensemble to work much better than a single estimator. Gradient boosting builds an ensemble of trees **one-by-one**, then the predictions of the individual trees are summed. For example, if an ensemble has 3 trees the prediction of that ensemble is:

The next tree () in the ensemble should complement well the existing trees and minimize the training error of the ensemble. In the ideal case we’d happy to have:

where in the target function. Obviously the previous equation depicts an ideal condition and the reality will be done of approximations.

The contribution of each tree to the previous sum can be weighted to slow down the learning by the algorithm. This weighting is called a **shrinkage** or a **learning rate** (usually represented by the small greek letter eta ). Empirical evidence suggests that small values of learning rate favor better test error. The effect is that *learning is slowed down*, in turn require *more trees to be added to the model*, in turn taking longer to train, providing a configuration trade-off between the number of trees and learning rate. The learning rate corresponds to how quickly the error is corrected from each tree to the next and is a simple multiplier .

### XGBoost installation

Assuming you have a working SciPy environment, XGBoost can be installed easily using pip.

sudo pip install xgboost

You can learn more about how to install XGBoost for different platforms on the XGBoost Installation Guide. For up-to-date instructions for installing XGBoost for Python see the XGBoost Python Package. For reference, you can review the XGBoost Python API reference.

In this tutorial we are going to use the commonly used Indians onset of diabetes dataset. This dataset is comprised of 8 input variables that describe medical details of patients and one output variable to indicate whether the patient will have an onset of diabetes within 5 years. You can learn more about this dataset on the UCI Machine Learning Repository website. This is a good dataset for a first XGBoost model because all of the input variables are numeric and the problem is a simple binary classification problem. It is not necessarily a good problem for the XGBoost algorithm because it is a relatively small dataset and an easy problem to model. Download this dataset and place it into your current working directory with the file name “diabetes.csv“.

### Data preparation and loading using scikit-learn

In this section we will load the data from file and prepare it for use for training and evaluating an XGBoost model.

We will start off by importing the classes and functions we intend to use in this tutorial.

import numpy import xgboost from sklearn import cross_validation from sklearn.metrics import accuracy_score

Next, we can load the CSV file as a NumPy array using the NumPy function `loadtxt()`

.

dataset = numpy.loadtxt('diabetes.csv', delimiter = ",")

We must separate the columns (attributes or features) of the dataset into input patterns (X) and output patterns (Y). We can do this easily by specifying the column indices in the NumPy array format.

features = dataset[:,0:8] labels = dataset[:,8]

Finally, we must split the X and Y data into a training and test dataset. The training set will be used to prepare the XGBoost model and the test set will be used to make new predictions, from which we can evaluate the performance of the model.

For this we will use the `train_test_split()`

function from the scikit-learn library. We also specify a seed for the random number generator so that we always get the same split of data each time this example is executed.

test_size = 0.33 X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, labels, test_size = test_size)

We are now ready to train our model.

### XGBoost Model training

XGBoost provides a wrapper class to allow models to be treated like classifiers or regressors in the scikit-learn framework. This means we can use the full scikit-learn library with XGBoost models.

The XGBoost model for classification is called `XGBClassifier`

. We can create and fit it to our training dataset. Models are fit using the scikit-learn API and the `model.fit()`

function.

Parameters for training the model can be passed to the model in the constructor. Here, we use the sensible defaults.

model = xgboost.XGBClassifier() model.fit(X_train, y_train)

You can see the parameters used in a trained model by printing the model, for example:

print model

or using the function `get_xgb_params()`

:

print model.get_xgb_params()

You can learn more about the defaults for the `XGBClassifier`

and `XGBRegressor`

classes in the XGBoost Python scikit-learn API.

We are now ready to use the trained model to make predictions.

### XGBoost Model prediction assessment

We can make predictions using the fit model on the test dataset. To make predictions we use the scikit-learn function **model.predict()**.

By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class. We can easily convert them to binary class values by rounding them to 0 or 1.

y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred]

Now that we have used the fit model to make predictions on new data, we can evaluate the performance of the predictions by comparing them to the expected values. For this we will use the built in `accuracy_score()`

function in scikit-learn.

accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))

### In summary

Putting it all together the following is the full code, well commented for easier modification and tweaking.

# first XGBoost model import numpy import xgboost from sklearn import cross_validation from sklearn.metrics import accuracy_score # set the seed of random number generator, which is useful for creating simulations # or random objects that can be reproduced. import random SEED = 3 random.seed(SEED) numpy.random.seed(SEED) # load data dataset = numpy.loadtxt('diabetes.csv', delimiter = ",") # split data into features and labels features = dataset[:,0:8] labels = dataset[:,8] # split data into train and test sets test_size = 0.33 X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, labels, test_size = test_size, random_state = SEED) # fit model model = xgboost.XGBClassifier(seed = SEED) model.fit(X_train, y_train) # printing the model for visualization print model # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))

Running this example produces the following output.

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, nthread=-1, objective='binary:logistic', reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=3, silent=True, subsample=1) Accuracy: 78.35%

The accuracy may be different in your pc.

### Hyperparameters tuning

Algorithm tuning is a final step in the process of applied machine learning before presenting results. It is sometimes called *hyperparameter optimization* where the algorithm parameters are referred to as **hyperparameters** whereas the coefficients found by the machine learning algorithm itself are referred to as **parameters** or **weights**. Optimization suggests the search-nature of the problem.

In the previous code block we printed out the model and got useful details about parameters.

{ 'reg_alpha': 0, 'colsample_bytree': 1, 'silent': 1, 'colsample_bylevel': 1, 'scale_pos_weight': 1, 'learning_rate': 0.1, 'missing': None, 'max_delta_step': 0, 'base_score': 0.5, 'n_estimators': 100, 'subsample': 1, 'reg_lambda': 1, 'seed': 3, 'min_child_weight': 1, 'objective': 'binary:logistic', 'max_depth': 3, 'gamma': 0 }

We start with the shrinkage hyperparameter `learning_rate`

. Its default value is 0.1 and the range is . Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. In practice one typically chooses the `learning_rate`

beforehand and varies the number of trees `n_estimators`

with respect to the chosen shrinkage. The most frequently used approach to deal with this trade-off relies on the cross-validation procedure.

**Cross-validation**provides means for testing the model on withheld portions of data, while still using all of the data at the processing stages. First, the shrinkage parameter, the maximum number of estimators and the cross-validation parameter

`nfold`

, corresponding to the number of validation folds, are specified. The data is then partitioned into `nfold`

disjoint non-overlapping subsets. Afterwards, for each of the `nfold`

subsets of the data, one of them is set aside as the validation set and the others are used for fitting the model. The fitted model is then tested on the validation set to produce the held-out estimates of the predictive performance. At last, the validation performance is aggregated from each of the folds, for example, by averaging the validation set performances. This aggregated measure serves as estimate of model generalization on the validation set. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required. Briefly, we have:

import numpy from xgboost.sklearn import XGBClassifier # set the seed of random number generator, which is useful for creating simulations # or random objects that can be reproduced. import random SEED=3 random.seed(SEED) numpy.random.seed(SEED) # load data dataset = numpy.loadtxt('diabetes.csv', delimiter = ",") # split data into features and labels features = dataset[:,0:8] labels = dataset[:,8] # 5 fold cross validation is more accurate than using a single validation set cv_folds = 5 early_stopping_rounds = 50 model=XGBClassifier(seed = SEED) xgb_param = model.get_xgb_params() xgtrain = xgboost.DMatrix(features, label=labels) cvresult = xgboost.cv(xgb_param, xgtrain, num_boost_round = 1000, nfold = cv_folds, metrics = 'auc', early_stopping_rounds = early_stopping_rounds, seed = SEED) print cvresult print ("Optimal number of trees (estimators) is %i" % csresult.shape[0])

The output is:

test-auc-mean test-auc-std train-auc-mean train-auc-std 0 0.802169 0.038329 0.837978 0.005065 .... 79 0.826352 0.019820 0.952760 0.003617 Optimal number of estimators is 80

With the `early_stopping_rounds = 50`

and setting `num_boost_round = 1000`

, the model will train until the validation score stops improving. The maximum number of trainings (rounds) is `1000`

. Validation error needs to decrease at least every `50`

rounds to continue training. This procedure is also named the **validation-based early stopping**^{2}. The metric watched in cross validation for early stopping is AUC (Area under curve) `metrics = 'auc'`

.

Assuming the `learning_rate`

equals to `0.1`

, we got that the optimal number of estimators is `80`

. When required, by increasing the learning rate, we can decrease the number of estimators.

Next step is about **tree-specific parameters** (parameters influencing the tree shape):

`max_depth`

: maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. This should be between 3-10. 5 is a good starting point. Default: 6, range:`min_child_weight`

: minimum number of instances needed for each leaf. This parameter influences the partioning process: the larger it is, the lower will be the partioning (leaf nodes will not deeply located) and the more conservative the algorithm will be^{3}. Set it larger to prevent overfitting. Default values is 1, smaller values are required with highly imbalanced class problem. Range:`gamma`

: in a decision tree a node is split only when the resulting slit gives a positive reduction in the loss function^{4}. Gamma specifies the minimum loss reduction required to make a split and makes the algorithm conservative. Default is 0. A smaller value like 0.1-0.2 can also be chosen for starting. The values can vary depending on the loss function. Range:`subsample`

: for each tree is randomly selected a fraction of observations. This parameters define this fraction. 0.8 is a commonly used start value. Typical values range between 0.5-0.9. Default: 1, range:`colsample_bytree`

: the number of attributes to consider while searching for a best split. These will be randomly selected. As a thumb-rule, square root of the total number of attrbiutes works great but we should check upto 30-40% of the total number of features. Higher values can lead to over-fitting but depends on case to case. Typical values range between 0.5-0.9. Default: 1, range:`scale_pos_weight`

: for unbalanced classes it helps to introduce a re-balancement, in particular, a typical value to consider is`sum(negative cases) / sum(positive cases)`

. See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3. Default: 1

Tuning can be concretely executed using the sklearn `GridSearchCV`

. **Grid search** is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

The code block below evaluates different `max_depth`

and `min_child_weight`

values on the diabetes dataset. This is a two-dimensional grid search.

from sklearn.grid_search import GridSearchCV # using features, labels, cv_folds, model from previous example model.set_params(n_estimators = 80) param_test1 = { 'max_depth': [2,3,4], 'min_child_weight': [1,2,3] } gsearch1 = GridSearchCV(estimator = model, param_grid = param_test1, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = cv_folds, verbose = 2) gsearch1.fit(features,labels) #print gsearch1.grid_scores_ print gsearch1.best_params_ print gsearch1.best_score_

The output is:

{'max_depth': 2, 'min_child_weight': 1} 0.83236338225

Here we have run 27 (3!) combinations. The suggested values are `2`

for `max_depth`

and `1`

for `min_child_weight`

.

Now lets tune `gamma`

value using the parameters already tuned above. We will check it for 5 values.

# using features, labels, cv_folds, model from previous example model.set_params(n_estimators = 80) model.set_params(max_depth = 2) model.set_params(min_child_weight = 1) param_test2 = { 'gamma':[i/10.0 for i in range(0,5)] } gsearch2 = GridSearchCV(estimator = model, param_grid = param_test2, scoring = 'roc_auc',n_jobs = 4, iid = False, cv = cv_folds, verbose = 2) gsearch2.fit(features,labels) print gsearch2.best_params_ print gsearch2.best_score_

The output is:

{'gamma': 0.0} 0.83236338225

This shows that our original value of gamma `0.0`

is the optimum one.

Now, it is a good idea to re-calculate the optimal number of estimators by using the new settings. This task is identical to that already described and the resulting value is `108`

for `n_estimators`

.

Now we complete the tuning of tree-specific parameters by checking different `subsample`

and `colsample_bytree`

value combinations.

# using features, labels, cv_folds, model, cvresult from previous example model.set_params(n_estimators = 108) model.set_params(max_depth = 2) model.set_params(min_child_weight = 1) param_test3 = { 'subsample' : [i/10.0 for i in range(6,11)], 'colsample_bytree' : [i/10.0 for i in range(6,11)] } gsearch3 = GridSearchCV(estimator = model, param_grid = param_test3, scoring = 'roc_auc',n_jobs = 4, iid = False, cv = cv_folds, verbose = 2) gsearch3.fit(features,labels) print gsearch3.best_params_ print gsearch3.best_score_

The output is:

{'subsample': 0.9, 'colsample_bytree': 0.6} 0.842484975542

We found `0.9`

as the optimum value for `subsample`

and `0.6`

for `colsample_bytree`

.

And finally we conclude this tuning session applying **regularization** to reduce **overfitting**.

where is the loss function, is the vector of weights and is the **hypothesis function** that approximates the target function ^{5}. The target function is unknown to us.

Overfitting is a common problem that interferes with the attempt to develop accurate predictions. When overfitting occurs, not only explains pattern but also random noise in the training set. This effect leads to degradation of generalization ability of our model.

A common way to reduce overfitting is to add a **regularization term** to the previous equation:

is a **regularizer** to penalize certain values of and is the **regularization parameter**. The regularization term imposes a special penalty on the loss function. is either the (Lasso^{6}), (Ridge^{7}) or any other norm. In particular, on models with large weights ( regularization), or with a lot of non-zero weights ) ( regularization). If we are training decision tree, can be the tree depth. It is possible to combine Ridge and Lasso linearly to get **Elastic Net Regression** (both squared and absolute value components will be included in the cost function). It also is a good idea to appropriately scale attributes, so that weights are penalized based on their predictive power and not their scale.

A value of that is too low might not do anything, and one that is too high might actually cause us to underfit the model and lose valuable information. It’s up to the user to find the sweet spot.

With XGBoost, `reg_alpha`

(default: 0) is the regularization parameter for L1 regularization term and `reg_lambda`

(default: 1) is the regularization parameter for L2 term.

We’ll tune `reg_alpha`

value by checking different values.

# using features, labels, cv_folds, model from previous example model.set_params(n_estimators = 108) model.set_params(max_depth = 2) model.set_params(min_child_weight = 1) model.set_params(subsample=0.9) model.set_params(colsample_bytree=0.6) param_test4 = { 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100] } gsearch4 = GridSearchCV(estimator = model, param_grid = param_test4, scoring = 'roc_auc',n_jobs = 4, iid = False, cv = cv_folds, verbose = 2) gsearch4.fit(features,labels) print gsearch4.best_params_ print gsearch4.best_score_

The output is:

{'reg_alpha': 0.1} 0.843246680643

We got a better `reg_alpha`

: `0.1`

.

In summary:

- n_estimators=108
- max_depth=2
- min_child_weight=1
- subsample=0.9
- colsample_bytree=0.6
- reg_alpha=0.1

Now, assuming the initial test set, we can apply this parametrization in the model and look at the impact on accuracy. The new accuracy is `78.74%`

. To evaluate how good is this result, we can check this comparison page.

`75`

and it will affect `subsample`

which will impact on `reg_alpha`

. It’s very important to carefully control this chain-reaction of changes. With reference to the initial test set, after applying a Grid search on `learning_rate`

parameter, we got an accuracy of 79.92% just by updating `learning_rate`

to `0.325`

.You can learn more about the meaning of each parameter and how to configure them on the XGBoost parameters page.

### Conclusions

In this post you learnt how to develop and tune your first XGBoost model in Python.

The following is a list of the resources which I found most useful and would suggest for further reading.

- Data Preparation for Gradient Boosting with XGBoost in Python
- Kaggle Competition: BNP Paribas Cardif Claims Management
- A Guide to Gradient Boosted Trees with XGBoost in Python
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- Notes on eXtreme Gradient Boosting
- Complete Guide to Parameter Tuning in XGBoost (with codes in Python)
- Kaggle Titanic Survival Prediction
- How to Tune Algorithm Parameters with Scikit-Learn
- Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python
- How to Visualize Gradient Boosting Decision Trees With XGBoost in Python

- Heuristics are a way to use available data in a problem to provide guidance in finding a solution when an exhaustive search is not practical.
- Another very important purpose of cross validation is to avoid overfitting. It randomly divides training set into N folds. Then it sets each fold aside for testing and trains on the remaining data. Thus it runs your parameters N times and then returns you mean performance on training and testing sets along with respective standard deviations.
- Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- The job of loss/cost function is to tell us how “bad” a system’s prediction is in comparison to the truth. In particular, if is the truth and is the system’s prediction, then is a measure of error.
- The set of all possibile hypothesis is know as
**hypothesis set**. - https://en.wikipedia.org/wiki/Lasso_(statistics)
- https://en.wikipedia.org/wiki/Tikhonov_regularization

## Sharing: