Online Payment Fraud Detection Using Machine Learning in Python

The practice of making payments online is becoming increasingly popular as we move closer to modern times. Online payment is particularly advantageous for the customer since it eliminates the issue of free money and saves time. Furthermore, the currency is not necessary for us to carry. However, we are all aware that good things can come with unpleasant ones.

Any payment software may be used to commit fraud, which is why online payment methods are dangerous. Online payment fraud detection is crucial because of this.

Python Machine Learning for the Detection of Online Payment Fraud

Here, we'll use Python machine learning to address this problem.

These columns are part of the dataset that we'll be using:

S. No.	Feature	Description
1	step	describes the time unit.
2	type	kind of transaction carried out
3	amount	the transaction's total amount
4	nameOrg	account from which the transaction originates
5	oldbalanceOrg	Sender's account balance prior to the transaction
6	newbalanceOrg	Sender's account balance following the transaction
7	nameDest	account that is charged with the transaction
8	oldbalanceDest	The recipient's account balance prior to the transaction
9	newbalanceDest	balance of the recipient's account following the transaction
10	isFraud	The projected value, either 0 or 1.

Bringing in Datasets and Libraries

The following libraries are utilized:

Pandas: This library offers several methods to complete analytical jobs simultaneously and aids in loading data frames in a 2D array format.
Matplotlib/Seaborn: For the visualization of data.
Numpy: Large calculations may be completed quickly and efficiently with Numpy arrays.

import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score as ras
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, recall_score, f1_score
from sklearn.utils import resample

Explanation:

In order to develop a fraud detection model, this Python code snippet imports the required libraries for data analysis and machine learning, sets up the visualization environment, and imports particular classifiers such as XGBClassifier, LogisticRegression, RandomForestClassifier, and SVC. The code divides the data into training and testing sets, trains the models, and assesses the results using metrics like ROC-AUC, confusion matrix, recall score, and F1 score. It does this by utilizing the XGBoost method and other classifiers. It further contains the %matplotlib inline magic command for inline charting in Jupyter notebooks and the resample tool to manage unbalanced datasets.

Features such as payment type, old balance, amount paid, destination name, etc., are included in the dataset.

data = pd.read_csv('file_path')
data.head()

Explanation:

Using the pd.read_csv method, the code tries to read a CSV file into a Pandas DataFrame. Nevertheless, the file path contains a mistake. The way it should be written is pd.read_csv('file_path'). The first few rows of the DataFrame are then shown using data.head(), giving the user a preview of the data that has been loaded. 'file_path' must be changed to the true path of your CSV file in order for the code to run properly.

Output:

Online Payment Fraud Detection Using Machine Learning in Python

To check the information regarding the data, we will use info() method as shown below:

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)

Let's examine the data's mean, count, lowest, and maximum values.

Output:

Data Visualization

We will attempt to comprehend and contrast each column in this section.

Let's count the columns that include various data types, such as float, integer, and category.

obj = (data.dtypes == 'object')
object_cols = list(obj[obj].index)
print("Categorical variables:", len(object_cols))

int_ = (data.dtypes == 'int')
num_cols = list(int_[int_].index)
print("Integer variables:", len(num_cols))

fl = (data.dtypes == 'float')
fl_cols = list(fl[fl].index)
print("Float variables:", len(fl_cols))

Explanation:

In the first section, categorical variables are identified by using data. dtypes to determine if the data type is 'object'. For object data types, it generates a boolean series called obj with True and then retrieves the index (column names) where the condition is True. This list's length is provided together with the number of category variables.

Comparably, integer variables are identified in the second section. For integer data types, it generates the boolean series int_, extracts the index, and outputs the number of integer variables.

A similar method is used in the third section to identify float variables, and the count is printed.

Output:

Categorical variables: 3
Integer variables: 2
Float variables: 5

Using the Seaborn library, let's examine the count plot of the Payment type column.

Explanation:

SNS: This is an alias for the Seaborn library, a popular tool for visualizing statistical data.

The Seaborn function counterplot was created especially to count the instances of each category in categorical data. It shows bars representing the number of observations for each category.

x='type': This indicates that the x-axis will be plotted using the category variable 'type.' It suggests that the 'type' variable should have a count of each category's instances.
data=data: This option designates the DataFrame (data) from which the count plot's data will be taken.

Output:

The bar plot may also be used to analyze the Type and quantity columns at the same time.

Explanation:

sns: The Seaborn Library uses this as an alias.
Barplot: This Seaborn function uses bars to illustrate the relationship between a numeric variable and a category variable.
x='type': This indicates that the x-axis will be plotted using the category variable 'type.'
y='amount' designates the height of the bars on the y-axis to reflect the numeric variable 'amount.'
data=data: This option specifies the DataFrame from which the information for the bar plot will be taken.

Output:

Let us examine the data distribution between the two forecast values.

Explanation:

The code applies the value_counts() function to the DataFrame 'data''s 'isFraud' column. When a category column is treated with this technique, a Series with counts of unique values is returned. Here, the 'isFraud' column, which most likely includes binary values (e.g., 0 for non-fraud and 1 for fraud), is used to count the instances of each unique value.

The output will show the number of occurrences for each distinct value in the 'isFraud' column. This data offers insights into the class balance and may be used to analyze how fraud and non-fraud cases are distributed across the dataset.

Output:

isFraud
0    6354407
1       8213

The dataset already has the same number. So, sampling is not necessary.

Let's now use Distplot to view the step column distribution.

plt.figure(figsize=(15, 6))
sns.distplot(data['step'], bins=50)

Explanation:

The size of the figure to be constructed is set using plt. figure(figsize=(15, 6)): 15 units for width and 6 units for height.

Plots a distribution plot for the step column using the function sns. distplot(data['step'], bins=50). The bins=50 parameter indicates the number of bins or intervals in the histogram. A kernel density estimate and a histogram depiction of the data are combined in the graphic.

Output:

Let's now use a heatmap to determine the association between various characteristics.

numeric_data = data.select_dtypes(include=['number'])
plt.figure(figsize=(12, 6))
sns.heatmap(numeric_data.corr(),
			cmap='BrBG',
			fmt='.2f',
			linewidths=2,
			annot=True)

Output:

Data Preprocessing

The following are included in this step:

Type column encoding
Removing unnecessary columns such as nameOrig and nameDest
Data Splitting

type_new = pd.get_dummies(data['type'], drop_first=True)
data_new = pd.concat([data, type_new], axis=1)
data_new.head()

Explanation:

The function pd.get_dummies(data['type'], drop_first=True) generates dummy variables by transforming categorical values into binary columns for the 'type' column. To prevent multicollinearity, the first level is dropped when drop_first=True is used.

pd.concat([data, type_new], axis=1): Concatenates the newly constructed dummy variable columns ('type_new') along the columns (axis=1) with the original DataFrame 'data'.

Output:

We may now remove the unnecessary columns when the encoding is complete. Use the code provided below to accomplish that.

X = data_new.drop(['isFraud', 'type', 'nameOrig', 'nameDest'], axis=1)
y = data_new['isFraud']

Let's examine the extracted data's form.

Explanation:

pd.concat([data, type_new], axis=1): Concatenates the newly constructed dummy variable columns ('type_new') along the columns (axis=1) with the original DataFrame 'data'.

Output:

((6362620, 10), (6362620,))

Let's now divide the data into two categories: testing and training.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

Model Training

Since the prediction involves categorization, the following models will be applied:

Logistic Regression: Logistic regression makes predictions about the likelihood that a given set of data will fall into a specific category or not.
XGBClassifier: This stands for decision trees with gradient boosts. This method creates decision trees sequentially, giving each independent variable a weight before feeding it into the decision tree to make predictions.
SVC: In an N-dimensional space, SVC is used to locate a hyperplane that clearly categorises the data points. The output is then shown based on the element closest to it.

RandomForestClassifier: Using a randomly chosen portion of the training data, the Random Forest Classifier generates a series of decision trees. After that, it compiles the votes from many decision trees to determine the outcome.

models = [LogisticRegression(), XGBClassifier(),
	RandomForestClassifier(n_estimators=7,
								criterion='entropy',
								random_state=7)]
for i in range(len(models)):
	models[i].fit(X_train, y_train)
	print(f'{models[i]} : ')
	
	train_preds = models[i].predict_proba(X_train)[:, 1]
	print('Training Accuracy : ', ras(y_train, train_preds))
	
	y_preds = models[i].predict_proba(X_test)[:, 1]
	print('Validation Accuracy : ', ras(y_test, y_preds))
	print()

Output:

LogisticRegression() : 
Training Accuracy:  0.8873981954950323
Validation Accuracy:  0.8849953734622176
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...) : 
Training Accuracy:  0.9999817672497334
Validation Accuracy:  0.9994777637994892
RandomForestClassifier(criterion='entropy', n_estimators=7, random_state=7) : 
Training Accuracy:  0.9999992716004644
Validation Accuracy:  0.9650098729693373

Model Evaluation

XGBClassifier is the model that performs the best. For the same, let's plot the confusion matrix.

from sklearn.metrics import plot_confusion_matrix
# Assuming models[1] is your XGBClassifier
plot_confusion_matrix(models[1], X_test, y_test, cmap="Blues", values_format="d", figsize=(8, 6))
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

Output: