Online Payment Fraud Detection Using Machine Learning in PythonThe practice of making payments online is becoming increasingly popular as we move closer to modern times. Online payment is particularly advantageous for the customer since it eliminates the issue of free money and saves time. Furthermore, the currency is not necessary for us to carry. However, we are all aware that good things can come with unpleasant ones. Any payment software may be used to commit fraud, which is why online payment methods are dangerous. Online payment fraud detection is crucial because of this. Python Machine Learning for the Detection of Online Payment FraudHere, we'll use Python machine learning to address this problem. These columns are part of the dataset that we'll be using:
Bringing in Datasets and LibrariesThe following libraries are utilized:
Explanation: In order to develop a fraud detection model, this Python code snippet imports the required libraries for data analysis and machine learning, sets up the visualization environment, and imports particular classifiers such as XGBClassifier, LogisticRegression, RandomForestClassifier, and SVC. The code divides the data into training and testing sets, trains the models, and assesses the results using metrics like ROC-AUC, confusion matrix, recall score, and F1 score. It does this by utilizing the XGBoost method and other classifiers. It further contains the %matplotlib inline magic command for inline charting in Jupyter notebooks and the resample tool to manage unbalanced datasets. Features such as payment type, old balance, amount paid, destination name, etc., are included in the dataset. Explanation: Using the pd.read_csv method, the code tries to read a CSV file into a Pandas DataFrame. Nevertheless, the file path contains a mistake. The way it should be written is pd.read_csv('file_path'). The first few rows of the DataFrame are then shown using data.head(), giving the user a preview of the data that has been loaded. 'file_path' must be changed to the true path of your CSV file in order for the code to run properly. Output: To check the information regarding the data, we will use info() method as shown below: Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 6362620 entries, 0 to 6362619 Data columns (total 11 columns): # Column Dtype --- ------ ----- 0 step int64 1 type object 2 amount float64 3 nameOrig object 4 oldbalanceOrg float64 5 newbalanceOrig float64 6 nameDest object 7 oldbalanceDest float64 8 newbalanceDest float64 9 isFraud int64 10 isFlaggedFraud int64 dtypes: float64(5), int64(3), object(3) Let's examine the data's mean, count, lowest, and maximum values. Output: Data VisualizationWe will attempt to comprehend and contrast each column in this section. Let's count the columns that include various data types, such as float, integer, and category. Explanation: In the first section, categorical variables are identified by using data. dtypes to determine if the data type is 'object'. For object data types, it generates a boolean series called obj with True and then retrieves the index (column names) where the condition is True. This list's length is provided together with the number of category variables. Comparably, integer variables are identified in the second section. For integer data types, it generates the boolean series int_, extracts the index, and outputs the number of integer variables. A similar method is used in the third section to identify float variables, and the count is printed. Output: Categorical variables: 3 Integer variables: 2 Float variables: 5 Using the Seaborn library, let's examine the count plot of the Payment type column. Explanation:
The Seaborn function counterplot was created especially to count the instances of each category in categorical data. It shows bars representing the number of observations for each category.
Output: The bar plot may also be used to analyze the Type and quantity columns at the same time. Explanation:
Output: Let us examine the data distribution between the two forecast values. Explanation: The code applies the value_counts() function to the DataFrame 'data''s 'isFraud' column. When a category column is treated with this technique, a Series with counts of unique values is returned. Here, the 'isFraud' column, which most likely includes binary values (e.g., 0 for non-fraud and 1 for fraud), is used to count the instances of each unique value. The output will show the number of occurrences for each distinct value in the 'isFraud' column. This data offers insights into the class balance and may be used to analyze how fraud and non-fraud cases are distributed across the dataset. Output: isFraud 0 6354407 1 8213 The dataset already has the same number. So, sampling is not necessary. Let's now use Distplot to view the step column distribution. Explanation: The size of the figure to be constructed is set using plt. figure(figsize=(15, 6)): 15 units for width and 6 units for height. Plots a distribution plot for the step column using the function sns. distplot(data['step'], bins=50). The bins=50 parameter indicates the number of bins or intervals in the histogram. A kernel density estimate and a histogram depiction of the data are combined in the graphic. Output: Let's now use a heatmap to determine the association between various characteristics. Output: Data PreprocessingThe following are included in this step:
Explanation: The function pd.get_dummies(data['type'], drop_first=True) generates dummy variables by transforming categorical values into binary columns for the 'type' column. To prevent multicollinearity, the first level is dropped when drop_first=True is used. pd.concat([data, type_new], axis=1): Concatenates the newly constructed dummy variable columns ('type_new') along the columns (axis=1) with the original DataFrame 'data'. Output: We may now remove the unnecessary columns when the encoding is complete. Use the code provided below to accomplish that. Let's examine the extracted data's form. Explanation: The function pd.get_dummies(data['type'], drop_first=True) generates dummy variables by transforming categorical values into binary columns for the 'type' column. To prevent multicollinearity, the first level is dropped when drop_first=True is used. pd.concat([data, type_new], axis=1): Concatenates the newly constructed dummy variable columns ('type_new') along the columns (axis=1) with the original DataFrame 'data'. Output: ((6362620, 10), (6362620,)) Let's now divide the data into two categories: testing and training. Model TrainingSince the prediction involves categorization, the following models will be applied:
RandomForestClassifier: Using a randomly chosen portion of the training data, the Random Forest Classifier generates a series of decision trees. After that, it compiles the votes from many decision trees to determine the outcome. Output: LogisticRegression() : Training Accuracy: 0.8873981954950323 Validation Accuracy: 0.8849953734622176 XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...) : Training Accuracy: 0.9999817672497334 Validation Accuracy: 0.9994777637994892 RandomForestClassifier(criterion='entropy', n_estimators=7, random_state=7) : Training Accuracy: 0.9999992716004644 Validation Accuracy: 0.9650098729693373 Model Evaluation XGBClassifier is the model that performs the best. For the same, let's plot the confusion matrix. Output: |
We provides tutorials and interview questions of all technology like java tutorial, android, java frameworks
G-13, 2nd Floor, Sec-3, Noida, UP, 201301, India