How to Use LightGBM in Python

Introduction

The field of AI has seen gigantic headways, prompting the advancement of different algorithms to handle complex undertakings. One such algorithm is LightGBM, short for Light Gradient Boosting Machine. LightGBM is acquiring prominence because of its proficiency, speed, and capacity to deal with enormous scope datasets. In this article, we will investigate what LightGBM is, the way it works, and how to involve it in Python to help your ML models.

To understand LightGBM we need to understand the concept of Gradient Boosting:

Before we dive into LightGBM, it's essential to grasp the concept of gradient boosting. Gradient boosting is an ensemble learning method that combines multiple weak learners, often decision trees, to create a stronger predictive model. The key idea behind gradient boosting is to sequentially add new weak learners to the model, with each subsequent learner correcting the errors made by its predecessors. This iterative approach leads to an ensemble model that is more accurate and robust than individual base models.

Birth and What is LightGBM?

LightGBM was developed by Guolin Ke, et al., at Microsoft in 2016. The motivation behind creating LightGBM was to address the limitations of traditional gradient boosting frameworks in terms of efficiency and scalability. Traditional gradient boosting methods build trees in a level-wise manner, which can be computationally expensive, especially when dealing with large datasets. LightGBM aimed to overcome these challenges and provide a faster, more memory-efficient solution for building gradient boosting models.

LightGBM is a gradient boosting framework developed by Microsoft. It has a place with the group of ensemble learning strategies, which consolidate the expectations of a few powerless students (frequently decision trees) to make major areas of strength for a model. The expression " gradient boosting" refers to the iterative course of adding new weak students consecutively, where each new student adjusts the errors made by its predecessors.

Benefits of LightGBM

Speed and Productivity: LightGBM is intended to be quick and memory-effective. It utilizes a histogram-based way to deal with receptacle consistent component values, which fundamentally lessens the memory impression and velocities up the training process.

Dealing with huge Datasets: Because of its proficient plan, LightGBM can deal with enormous scope datasets that may not squeeze into memory with different algorithms.

Feature Importance: LightGBM gives a direct method for computing highlight significances, permitting you to acquire bits of knowledge into the most persuasive elements in your dataset.

Categorical Feature Support: Not at all like some conventional gradient descent libraries, LightGBM locally upholds absolute elements, wiping out the requirement for one-hot encoding.

Accurate Predictions: Despite its speed and efficiency, LightGBM does not compromise on model accuracy. It consistently delivers competitive performance compared to other gradient boosting frameworks.

Working Principle of LightGBM:

LightGBM achieves its efficiency through several key techniques:

Leaf-Wise Tree Growth: Unlike traditional depth-wise tree growth, LightGBM grows trees in a leaf-wise fashion. It selects the leaf node with the maximum delta loss to grow the tree. This approach leads to fewer levels and more complex trees, which contribute to improved model performance.

Gradient-Based One-Sided Sampling (GOSS): During training, LightGBM uses GOSS, a technique that focuses on selecting important data instances while discarding less informative ones. This process reduces the number of samples used for each iteration, effectively speeding up the training process without sacrificing accuracy.

Exclusive Feature Bundling (EFB): LightGBM employs EFB to combine exclusive features, reducing the number of split points to be considered. This technique significantly accelerates the tree-building process, especially for datasets with numerous features.

Features of LightGBM:

To tackle this problem, The LGBM or Light Gradient Boosting Model is utilized. It utilizes two sorts of procedures which are inclination in light of side testing or GOSS and Exclusive Feature bundling or EFB. So, GOSS will really reject the critical piece of the information part which have little Gradients and just utilize the excess information to gauge the general data gain. The information cases which have huge slopes really assume a larger part for calculation on data gain. GOSS can obtain exact outcomes with a critical data gain regardless of utilizing a more modest dataset than different models.

With the EFB, it puts the mutually exclusive features alongside only it will rarely take any non-no worth simultaneously to lessen the quantity of features. This effects the general outcome for a powerful feature elimination without compromising the accuracy of the split point.

By combining the two changes, it will secure up the preparation season of any calculation by multiple times. So LGBM can be considered slope supporting trees with the blend for EFB and GOSS. You can get to their authority documentation here.

The fundamental features of the LGBM model are as per the following :

Higher accuracy and a quicker training speed.
Low memory use
Relatively preferable accuracy over other boosting algos and handles overfitting much better while working with more modest datasets.
Parallel Learning support.
It is good to go with both little and huge datasets.

With the previously mentioned features and benefits of LGBM, it has turned into the default algorithm for ML competitions when somebody is working with a tabular sort of information in regard to both regression and classification problems.

Maths Behind LGBM

We use an idea known as decision trees so we can pack a capability like for instance, from the information space X, towards the slope space G. A training set with the cases like x1,x2 and up to xn is expected where every element is a vector with s dimensions in the space X. In every one of the restatements of a gradient boosting, every one of the negative gradients of a loss function with respect towards the result model are meant as g1, g2, and up to gn. The decision tree really isolates every single hub at the most revealing component, it likewise brings about the largest evidence gain. In this kind of model, the data improvement can be measured by the variance in the wake of segregating.

Code for LGBM in python:

To code a LightGBM (LGBM) model in Python, you'll initially have to introduce the required library and afterward continue with the code. LightGBM is a gradient boosting framework that gives quick and productive executions to ML tasks. You can install it utilizing the following command:

We will use the data set of titanic passengers.

To use the Titanic dataset for the LightGBM model, you really want to initially load the dataset, preprocess it, and afterward train the model. Expecting you have the Titanic dataset in a CSV document named "titanic.csv", here's the code to construct and prepare the LightGBM model:

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Titanic dataset
data = pd.read_csv("titanic.csv")

# Preprocess the data (You may need to handle missing values, feature engineering, and categorical encoding)

# Extract features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the LightGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)

# Set hyperparameters for the LightGBM model
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Train the LightGBM model
num_rounds = 100  # Number of boosting rounds (you can adjust this based on your data)
model = lgb.train(params, train_data, num_rounds)
# Make predictions on the testing set
y_pred = model.predict(X_test)

# Convert probabilities to binary predictions (0 or 1)
y_pred_binary = np.round(y_pred)

# Calculate the accuracy of the model's predictions
accuracy = accuracy_score(y_test, y_pred_binary)
print("Accuracy:", accuracy)

Please ensure that the Titanic dataset CSV file is in the correct location or update the file path accordingly. Additionally, consider performing further preprocessing and feature engineering based on the specific characteristics of your dataset to improve the model's performance.

Output:

Accuracy: 0.8235294117647058

In this example, the accuracy is approximately 0.82, which means the LightGBM model correctly predicted around 82% of the passengers' survival status in the testing set.