Credit Score Prediction using Machine Learning

In today's world, credit scores are essential to determine creditworthiness for lending institutions, and they impact everything from getting a mortgage to renting an apartment. With the rise of big data and machine learning, the credit scoring process has been revolutionized, making it more accurate and efficient. Machine learning algorithms have the ability to analyze vast amounts of data and provide more accurate predictions than traditional credit scoring models. This article will explore credit score prediction using machine learning, including its benefits and challenges.

Credit Score and its Importance

A credit score is a numerical representation of a person's creditworthiness based on their credit history, income, and other financial factors. It is a critical factor for lenders and credit card companies when deciding whether to approve a loan or extend credit. A credit score ranges from 300 to 850, with higher scores indicating better creditworthiness. A good credit score is typically above 700, while a score below 600 is considered poor.

Benefits of Machine Learning in Credit Scoring

Machine learning algorithms have revolutionized credit scoring by providing more accurate predictions of creditworthiness. Machine learning models are trained on vast amounts of data, enabling them to identify patterns and make more accurate predictions than traditional credit scoring models. Machine learning algorithms can also take into account a broader range of data, including non-traditional data sources such as social media, to make more accurate predictions.

One of the main benefits of machine learning in credit scoring is its ability to reduce bias. Traditional credit scoring models often have inherent biases based on factors such as race or gender. Machine learning algorithms are designed to be unbiased, as they are trained on data and do not incorporate any preconceived biases. This results in fairer credit-scoring decisions.

Machine learning algorithms are also more efficient than traditional credit scoring models. They can analyze vast amounts of data in a matter of seconds, providing near-instantaneous credit-scoring decisions. This makes the lending process faster and more efficient for both borrowers and lenders.

Challenges of Machine Learning in Credit Scoring

While machine learning has many benefits for credit scoring, there are also challenges to consider.

One of the main challenges is the complexity of machine learning models. Machine learning algorithms are often black boxes, making it difficult for lenders to understand how the algorithm arrived at its credit scoring decision. This can make it difficult for borrowers to understand why they were denied credit or how they can improve their credit scores.
Another challenge is the need for large amounts of high-quality data. Machine learning algorithms rely on large amounts of data to make accurate predictions. However, if the data is of poor quality or limited in scope, the algorithm may not be able to make accurate predictions.
Privacy is also a concern when using machine learning algorithms for credit scoring. Machine learning models require access to personal and financial data, which can be a concern for borrowers. Lenders must take steps to ensure that borrower data is protected and secure.

Python Implementation

Now we will try to implement it in the code.

Objective

Based on a client's monthly customer profile, the goal is to estimate the likelihood that they won't pay off their credit card bill in the future. The binary target variable is derived by tracking performance over the 18 months following the most recent credit card statement, and a default event is deemed to have occurred if the consumer does not make the required payment within 120 days of the statement date.

About Data

Each customer's aggregated profile characteristics at each statement date are contained in the dataset. Features fall into the following broad groups after being anonymized and normalized:

D_* = Delinquency variables
S_* = Spend variables
P_* = Payment variables
B_* = Balance variables
R_* = Risk variables

The following features are categorical:

['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

Importing Libraries

import os
import sys
import glob

import numpy as np
import pandas as pd

import matplotlib.pylab as plt
from matplotlib_venn import venn2
import seaborn as sns

from tqdm import tqdm
from itertools import cycle

from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import feature_selection

import lightgbm as lgb
import xgboost as xgb
import catboost as cat

import optbinning

pd.set_option("display.max_columns", None)

plt.style.use("ggplot")
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]
color_cycle = cycle(plt.rcParams["axes.prop_cycle"].by_key()["color"])

Loading Data

%%time
train_dataframe = pd.read_feather('train.feather')
test_dataframe = pd.read_feather('test.feather')
train_labels = pd.read_csv("train_labels.csv")
train_dataframe.shape, test_dataframe.shape

Output:

We have 5531451 rows and 190 columns in the Training dataset.

Output:

Target Distribution

fig, ax = plt.subplots(figsize=(10,5))
sns.countplot(x=train_labels.target)
plt.show()

Output:

Here ,0 --> Non Default and 1 --> Default

EDA

# checking train test customers

fig, ax = plt.subplots(figsize=(10,5))
set1 = set(train_dataframe.customer_ID.unique())
set2 = set(test_dataframe.customer_ID.unique())

venn2([set1, set2], ('train', 'test'))
plt.show()

Output:

We can see that there are no user intersections in the train-test data.

# s_2 date featue
train_dataframe['S_2'] = pd.to_datetime(train_dataframe['S_2'])
test_dataframe['S_2'] = pd.to_datetime(test_dataframe['S_2'])

train_dataframe['S_2'].min(), train_dataframe['S_2'].max()

Output:

Also not intersect the timeline.

# checking user profiles number of user-profiles vs. timeline

fig, ax = plt.subplots(figsize=(20,5))
train_dataframe.groupby("S_2")['customer_ID'].count().plot()
plt.title("train profiles vs timeline")
plt.show()

fig, ax = plt.subplots(figsize=(20,5))
test_dataframe.groupby("S_2")['customer_ID'].count().plot()
plt.title("test profiles vs timeline")
plt.show()

Output:

We can see that test profiles increased from October to April while train profiles remained consistent.

# check each customer profile length

fig, ax = plt.subplots(figsize=(20,5))
sns.countplot(x=train_dataframe.groupby("customer_ID")['customer_ID'].count().values)
plt.title("train customer profile length")
plt.show()

fig, ax = plt.subplots(figsize=(20,5))
sns.countplot(x=test_dataframe.groupby("customer_ID")['customer_ID'].count().values)
plt.title("test customer profile length")
plt.show()

Output:

We can see that the distributions of the train and test profile lengths are similar.

Feature Selection

# Training data preparation
# taking the latest profile features for each customer

train_dataframe = train_dataframe.groupby("customer_ID").tail(1).reset_index(drop=True)
test_dataframe = test_dataframe.groupby("customer_ID").tail(1).reset_index(drop=True)

# Merge with targets
train_dataframe = train_dataframe.merge(train_labels, on='customer_ID', how='left')

target_col = 'target'
drop_cols = ['customer_ID', 'S_2', target_col]
cat_cols = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
train_cols = [col for col in train_dataframe.columns if col not in drop_cols]

Information Value

Information value is one of the most useful techniques for selecting important variables in a predictive model. It helps to rank variables on the basis of their importance.

If the IV statistic is:

Less than 0.02, then the predictor is not useful for modeling (separating the Good from the Bads)
02 to 0.1, then the predictor has only a weak relationship to the Goods/Bads odds ratio
1 to 0.3, then the predictor has a medium strength related to the Goods/Bads odds ratio
3 to 0.5, then the predictor has a strong relationship to the Goods/Bads odds ratio.

Now, for selecting features, we are calculating IV values for each feature.

iv_score_dict = {}
for col in tqdm(train_cols):
    if col in cat_cols:
        optb = optbinning.OptimalBinning(dtype='categorical')
        optb.fit(train_dataframe[col], train_dataframe['target'])
    else:
        optb = optbinning.OptimalBinning(dtype='numerical')
        optb.fit(train_dataframe[col], train_dataframe['target'])
    binning_table = optb.binning_table
    binning_table.build()
    iv_score_dict[col] = binning_table.iv

iv_score_dataframe = pd.Series(iv_score_dict)
iv_score_dataframe.sort_values(ascending=False, inplace=True)

Output:

# top 10 imp iv featuares
iv_score_dataframe.head(10)

Output:

# iv score vs. features
fig, ax = plt.subplots(figsize=(20,5))
iv_score_dataframe.reset_index(drop=True).plot()
plt.show()

Output:

We can observe that the top 75 features have > 0.5 IV value, so those top 75 IV value features are strong predictors.

Weight of Evidence

The weight of evidence indicates how well an independent variable may predict the dependent variable. It is frequently referred to as a measure of the separation of good and poor consumers because it developed from the realm of credit scoring. Customers who miss a loan payment are referred to as "Bad Customers." and "Good Clients" are those who repaid their loans.

Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log

Steps of Calculating WOE

For a continuous variable, split data into ten parts (or lesser depending on the distribution).
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group.
Calculate WOE by taking the natural log of a division of % of non-events and % of events.

For one feature, we'll try to describe woe values and a woe plot.

col = 'P_2'
optb = optbinning.OptimalBinning(dtype='numerical')
optb.fit(train_dataframe[col], train_dataframe['target'])
binning_table = optb.binning_table
display(binning_table.build())

Output:

P_2 is a continuous feature, so we split it into 15 bins
each bin has non-event and event counts and rates
each bin has WOE, and IV values
for missing values it's created in the 16th bin

Output:

from this woe plot, we can observe that while increasing bins, the event rate decrease
you can observe that the black dotted line that is positively correlated with the target

# WOE plots for top 10 features
top10_features = iv_score_dataframe[:10].index.values

for col in top10_features:
    print("-"*100)
    print("="*100)
    print("################ Feature Name : ", col)
    print("\n\n")

    if col in cat_cols:
        optb = optbinning.OptimalBinning(dtype='categorical')
        optb.fit(train_dataframe[col], train_dataframe['target'])
    else:
        optb = optbinning.OptimalBinning(dtype='numerical')
        optb.fit(train_dataframe[col], train_dataframe['target'])

    binning_table = optb.binning_table
    display(binning_table.build())
    display(binning_table.plot(metric="woe"))

Output:

Selecting features that have IV values > 0.5.

selected_features = iv_score_dataframe[iv_score_dataframe > 0.5].index.values
cat_cols = [col for col in cat_cols if col in selected_features]
train_cols = [col for col in train_dataframe.columns if col in selected_features]

Correlation Heap

top_cols = [col for col in selected_features[:20] if col in train_cols]
corr_df = train_dataframe[top_cols].corr()
plt.figure(figsize=(25, 9))
sns.heatmap(corr_df,annot=True ,cmap=sns.color_palette("BrBG",2));
plt.show()

Output:

def drop_feature_selection(row, col, corr, row_iv, col_iv):
    if row_iv >= col_iv:
        return col
    else:
        return row

cor_matrix = train_dataframe[train_cols].corr().abs()
upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool_))
corr_df = upper_tri.stack().reset_index()
corr_df.columns = ['row', 'col', 'corr']
corr_df = corr_df.drop_duplicates()
corr_df = corr_df.sort_values('corr', ascending=False)
corr_df = corr_df.query("corr >= 0.8")
corr_df['row_iv'] = corr_df['row'].map(iv_score_dict)
corr_df['col_iv'] = corr_df['col'].map(iv_score_dict)

corr_df['drop_feature'] = corr_df.apply(lambda x: drop_feature_selection(x['row'], x['col'], x['corr'], x['row_iv'], x['col_iv']), axis=1)
corr_df

Output:

Modeling

# train valid split
train_data, valid_data = model_selection.train_test_split(train_dataframe, test_size=0.3, random_state=42, shuffle=True, stratify=train_dataframe['target'])

train_data.shape, valid_data.shape

Output:

selected_features = [col for col in selected_features if col not in corr_drop_features]
cat_cols = [col for col in cat_cols if col in selected_features]
train_cols = [col for col in train_dataframe.columns if col in selected_features]

X_train = train_data[train_cols].copy()
y_train = train_data[target_col].copy()

X_valid = valid_data[train_cols].copy()
y_valid = valid_data[target_col].copy()

X_test = test_dataframe[train_cols].copy()

# binning process


binning_process = optbinning.BinningProcess(                            
    variable_names=train_cols,
    categorical_variables=cat_cols
)

# estimator
estimator = linear_model.LogisticRegression()

# scorecard
scorecard = optbinning.Scorecard(
    binning_process=binning_process,
    estimator=estimator,
    scaling_method="min_max",
    scaling_method_params={"min": 300, "max": 850},
   
)

# model fitting
scorecard.fit(X_train, y_train)

Output:

# scorecard table
scorecard_df = scorecard.table(style="detailed")

Output:

# will try to understand the scorecard for one feature

scorecard_df.query("Variable == 'P_2'")

Output:

We can observe that P_2 features Points (score).
while increasing bins, the score also increasing
for example, if the user P_2 value is 0.73, then that user belongs to the 7th bin corresponding score is 22.45

Metrics

def amex_metric(y_true, y_pred, return_components=False) -> float:
    """Amex metric for ndarrays"""
    def top_four_percent_captured(df) -> float:
        """Corresponds to the recall for a threshold of 4 %"""
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
       
    def weighted_gini(df) -> float:
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(df) -> float:
        """Corresponds to 2 * AUC - 1"""
        df2 = pd.DataFrame({'target': df.target, 'prediction': df.target})
        df2.sort_values('prediction', ascending=False, inplace=True)
        return weighted_gini(df) / weighted_gini(df2)

    df = pd.DataFrame({'target': y_true.ravel(), 'prediction': y_pred.ravel()})
    df.sort_values('prediction', ascending=False, inplace=True)
    g = normalized_weighted_gini(df)
    d = top_four_percent_captured(df)

    if return_components: return g, d, 0.5 * (g + d)
    return 0.5 * (g + d)

train_data['predict_proba'] = scorecard.predict_proba(X_train)[:, 1]
valid_data['predict_proba'] = scorecard.predict_proba(X_valid)[:, 1]

train_score = amex_metric(train_data['target'], train_data['predict_proba'])
valid_score = amex_metric(valid_data['target'], valid_data['predict_proba'])

print("Train Score :", train_score)
print("Valid Score :", valid_score)

Output:

We have got a accuracy of 75 percent.

false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_train, train_data['predict_proba'])
optimal_idx = np.argmax(true_positive_rate - false_positive_rate)
optimal_threshold = thresholds[optimal_idx]
auc_score = metrics.auc(false_positive_rate, true_positive_rate)
print("Train Threshold value is:", optimal_threshold)

false_positive_rate1, true_positive_rate1, thresholds = metrics.roc_curve(y_valid, valid_data['predict_proba'])
optimal_idx = np.argmax(true_positive_rate1 - false_positive_rate1)
optimal_threshold1 = thresholds[optimal_idx]
auc_score1 = metrics.auc(false_positive_rate1, true_positive_rate1)
print("Valid Threshold value is:", optimal_threshold1)

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='Binning+LR: Train AUC = {0:.4f}'.format(auc_score))
plt.plot(false_positive_rate1, true_positive_rate1, 'r', label='Binning+LR: Valid AUC = {0:.4f}'.format(auc_score1))
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Output:

train_data['predict'] = (train_data['predict_proba'] > optimal_threshold).astype(int)
valid_data['predict'] = (valid_data['predict_proba'] > optimal_threshold).astype(int)

conf_mat = metrics.confusion_matrix(train_data['target'], train_data['predict'])
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))

sns.heatmap(conf_mat, square=True, annot=True, cmap='Blues', fmt='d', cbar=False, ax=ax1)
sns.heatmap(conf_mat/np.sum(conf_mat), annot=True, fmt='.2%', cmap='Blues', ax=ax2)
plt.show()

Output:

conf_mat = metrics.confusion_matrix(valid_data['target'], valid_data['predict'])
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))

sns.heatmap(conf_mat, square=True, annot=True, cmap='Blues', fmt='d', cbar=False, ax=ax1)
sns.heatmap(conf_mat/np.sum(conf_mat), annot=True, fmt='.2%', cmap='Blues', ax=ax2)
plt.show()

Output:

print(metrics.classification_report(train_data['target'], train_data['predict'], labels=[0, 1]))

Output:

We have an accuracy of 87%

print(metrics.classification_report(valid_data['target'], valid_data['predict'], labels=[0, 1]))

Output:

Scores

train_data['score'] = scorecard.score(X_train)
valid_data['score'] = scorecard.score(X_valid)

y_test = valid_data['target']
score = valid_data['score']

mask = y_test == 0

fig, ax = plt.subplots(figsize=(20,10))
plt.hist(score[mask], label="non-default", color="b", alpha=0.35)
plt.hist(score[~mask], label="default", color="r", alpha=0.35)
plt.xlabel("score")
plt.legend()
plt.show()

Output:

we can observe that default vs non-default score distribution
some overlap between 550 to 650
overall well separated
The number of non-default is higher than the default.

# Plot Distribution of Scores
plt.figure(figsize=(20,10))

plt.hist(score,
         bins=100,
         edgecolor='white',
         color = '#317DC2',
         linewidth=1.2)

plt.title('Scorecard Distribution', fontweight="bold", fontsize=14)

plt.xlabel('Score')
plt.ylabel('Count');

Output:

We can see that a lot of scorecards are in the range of 650-750.

# Plot Scores Against Probabilities
plt.figure(figsize=(20,10))

plt.scatter(x=score,
            y=valid_data['predict_proba'],
            #data=scorecard,
            color='#317DC2')

plt.title('Scores by Probability', fontweight="bold", fontsize=14)
plt.xlabel('Score')
plt.ylabel('Probability (Good)')

Output:

Conclusion

Machine learning algorithms have revolutionized the credit scoring process by providing more accurate and efficient credit scoring decisions. They are able to analyze vast amounts of data and identify patterns to make more accurate predictions. However, there are challenges to consider, such as the complexity of machine learning models and the need for large amounts of high-quality data. Privacy is also a concern, and lenders must take steps to ensure that borrower data is protected and secure. Despite these challenges, the benefits of machine learning in credit scoring are clear, and it is likely that machine learning will continue to play an increasingly important role in credit scoring in the future.

Next TopicExtrapolation in Machine Learning

← prev next →