Gradient Boosting Classification Explained Through Python

Ensemble Methods

Generally speaking, you would want to employ all of your good predictors rather than agonisingly choose one only because it has a 0.0001 accuracy gain. Ensemble learning enters the picture. Using several predictors and training on the data instead of just one predictor, ensemble learning typically yields better results than using a single model. For example, a Random Forest is just a collection of Decision Trees that have been copied and pasted.

Ensemble methods are similar to an orchestra in that several musicians play various instruments rather of just one, and by combining all the musical groupings, the music sounds better overall than it would if it were played by a single individual.

Gradient Boosting is a Boosting Technique more precisely than it is an Ensemble Learning approach. What is boosting, then?

Boosting

Boosting is a unique kind of ensemble learning strategy that creates a strong learner (a model with great accuracy) by merging several weak learners (predictors with poor accuracy). Each model learns from the errors made by its predecessor in order for this to function.

The two most used techniques for boosting are:

  • Adaptive Boosting (this topic is covered in my paper here)
  • Gradient Boosting

We're going to talk about gradient boosting.

Gradient Boosting

In Gradient Boosting, each predictor tries to improve on its predecessor by reducing the errors. But the fascinating idea behind Gradient Boosting is that instead of fitting a predictor on the data at each iteration, it actually fits a new predictor to the residual errors made by the previous predictor. Let's go through a step-by-step example of how Gradient Boosting Classification Works:

  1. The algorithm will obtain the log of the chances of the target feature in order to start making early predictions on the data. This is often calculated by dividing the total number of True values (values of 1) by the total number of False values (values of 0).
    Therefore, the log(odds) = log(4/2) ~ 0.7 if we had a dataset of six occurrences of breast cancer, with four examples of individuals with breast cancer (four target values = 1) and two examples of individuals without breast cancer (two target values = 0). Our basic estimator is this one.
  2. In order to make predictions, we use a logistic function to transform the log(odds) value to a probability. In keeping with the log(odds) value of 0.7 from our previous example, the logistic function would also be around 0.7.
  3. Given that this number is more than 0.5, the algorithm's base guess for each instance will be 0.7. The following formula may be used to convert the log(odds) into a probability:
  • It computes the residuals for each occurrence in the training set, or, to put it another way, the observed value less the projected value.
  • After doing this, it constructs a new Decision Tree in an attempt to forecast the residuals that were previously determined. But this is where it becomes a little more complicated than with Gradient Boosting Regression.

There is a maximum number of leaves that can be used while creating a decision tree. The user can specify this as a parameter; it typically ranges from 8 to 32. This results in two potential outcomes:

  • Several occurrences merge into a single leaf
  • Every instance has a leaf of its own.

We must change these values using the following formula, in contrast to Gradient Boosting for Regression, where we could just average the instance values to obtain an output value and leave the single instance as a leaf of its own:

Gradient Boosting Classification Explained Through Python

The symbol Σ signifies "sum of," and PreviousProb denotes the probability we previously computed, which in this case was 0.7. This change is applied to each leaf in the tree. Why do we act in this way? We cannot just combine them since they originate from two separate sources because, as you may recall, our base estimator is a log(odds) and our tree was genuinely constructed on a probability.

Making Predictions

Currently, we take two steps to create fresh forecasts:

  1. Get each instance's log(odds) prediction from the training set.
  2. Turn the forecast into a likelihood.

The following would be the formula for predicting each occurrence in the training set:

A hyperparameter called learning_rate is utilised to scale the contribution of each tree, giving up bias in exchange for increased variance. To prevent overfitting of the data, we multiply this number by the expected value.

Using the prior procedure for turning log(odds) values into probabilities, we must now transform the log(odds) forecast into a probability once we have computed it.

Repeating and speculating on info that hasn't been seen

Following this procedure, we compute the tree's updated residuals and build a new tree to suit the revised residuals. The procedure is then carried out once again until the residuals are minimal or a certain threshold is met.

The pseudo-code for making a fresh prediction on an unforeseen case using our training set of six trees would be:

Now that you have a basic understanding of the principles behind Gradient Boosting for Classification, let's get started with some code to solidify our understanding!

Using Scikit-Learn for Gradient Boosting Classification

The sample data that we will use is the scikit-learn prebuilt breast cancer dataset. Let's clear the air on a few imports first:

Explanation:

The code imports the libraries and modules required for machine learning and data analysis. It manipulates data using Pandas and NumPy and loads a dataset, runs cross-validation, and assesses a classification model using scikit-learn. The breast cancer dataset is loaded using the load_breast_cancer function, and a classification model is constructed using GradientBoostingClassifier. Cross-validation is performed using KFold, and a comprehensive classification performance report is produced using classification_report.

To assess the performance of our model, we are just importing pandas, numpy, our model, and a metric.

Explanation:

This line of code loads the sklearn library's breast cancer dataset and builds a DataFrame with the feature names as columns. The first five rows of the DataFrame are then shown, and a new column called "y" with the target labels-the diagnosis of breast cancer-is added.

Since working with a DataFrame is simpler, we will convert the data to that format for convenience. You are welcome to omit this step.

Here, we specify our features and labels and use 5-fold cross-validation to separate your data into train and validation sets.

Explanation:

By dividing the target variable (y) from the feature set (X) in the DataFrame (df), X and y are specified in this code. For repeatability, a 5-fold cross-validation object, kf, is constructed using random states and shuffling. The training and validation indices are assigned to train_index and val_index as the for loop iterates across the splits produced by kf. For each fold, X and Y are then divided into training and validation sets using these indices. With a learning rate of 0.1, a GradientBoostingClassifier is constructed, and its parameters are obtained using get_params().

Output:

 
{'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'deviance',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'presort': 'deprecated',
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}   

There are several aspects to consider, therefore I will focus on the most crucial ones here:

  • Criterion: The loss function that determines the best feature and threshold for data splitting
  • learning_rate: this option sets a scale for each tree's contribution
  • max_depth: this parameter sets a maximum depth for each tree.
  • n_estimators: The number of trees to be constructed
  • Init: initial estimator. It is, by default, the log(odds) translated to a probability (as we previously explained).

Explanation:

After training a gradient boosting model on the training set of data (X_train, y_train), the code assesses the model's performance on the validation set of data (X_val, y_val) by printing a classification report. For every class, the report provides data like as F1-score, recall, and accuracy.

Output:

 
precision    recall   f1-score   support
0       0.98      0.93      0.96        46
1       0.96      0.99      0.97        67
    accuracy                           0.96       113
   macro avg       0.97      0.96      0.96       113
weighted avg       0.96      0.96      0.96       113   

All right, 96% accuracy!