Telco Customer ChurnRate Analysis

In this Tutorial, we will go over how we developed simple yet practical models to account for the churn rate using the Kaggle Telco Customer dataset.

Background and Problem;
Data Summary and Exploratory Analysis;
Data Analyses;
Strategy Recommendations,

The drawbacks, and Future Research are all included in the particular procedure.

Background

Given the significant rise in the number of consumers utilising phone services, a telecom company's marketing department aims to keep existing customers from terminating their contracts while bringing in more new ones. The telecom company's growth rate must outpace its attrition rate to develop its customer base. Better pricing offerings, quicker internet connections, and a safer online experience from other businesses are a few of the reasons why current clients have left their telecom firms.

A high turnover rate will hurt a business's bottom line and stymie expansion. The telecom business would be able to determine how effectively it is keeping its current customers and identify the fundamental causes of existing consumers terminating their contracts with the help of our churn forecast.

With the help of our study, the telecom business may determine whether or not its offering is more beneficial than its rivals. The business may utilise the churn rate study to provide discounts, exclusive deals, and better products to retain current consumers because it is far less expensive to acquire new ones than to maintain its present clientele.

The Dataset

The data set from the telecom firm, which is derived from the IBM sample set collection, is accessible on Kaggle. In California, the firm serves 7043 consumers with internet and residential services. Helping the business anticipate consumer behaviour to keep them as clients and analysing all pertinent customer data to create targeted customer retention campaigns are our challenges.

The following details are included in the dataset that was provided:

Customer demographic data, such as age, gender, and marital status
Details about the customer's account, such as the number of months they have been with the business, paperless billing, mode of payment, monthly costs, and total charges
The way that customers use the service, such as when they stream TV or films
The client's signed-up services included phone, internet, multiples, online security, internet backup, gadget protection, and tech support.
Customer churn, or the absence of a customer during the previous month

Research Objectives

Among the factors that lead to the high retention rate, which is the most significant?
Which analytics model can accurately forecast a customer's turnover rate?
What are the benefits and drawbacks of employing various analytical models?
What targeted retention initiatives may the telecom firm create using the information we provide?

Rationale for the Study

Our churn research is crucial for the telecom firm to comprehend why the consumer has ceased utilising its product or service. It is difficult for the telecom business to enhance its product and service unless it knows how much income is lost overall due to customer cancellations, which customers are cancelling, and why.

We will analyse customer churn behaviour using Simple Linear Regression, Binomial Logit Regression, binomial reasoning Probit Regression, and Random Forest Regression, as churn rate analysis is a common classification issue within supervised learning.

Our study will assist the business in offering guidance on how to lower customer attrition by focusing on the demographic data, account details, use patterns, and services that customers have signed up for.

Exploratory Analysis and Data Summary

The secondary data we examined is accessible on the free-to-use data aggregation platform Kaggle.

Below code has some of the data connected.

<bound method NDFrame.describe of       customerID  gender  SeniorCitizen Partner Dependents  tenure  \
   7590-VHVEG  Female              0     Yes         No       1   
   5575-GNVDE    Male              0      No         No      34   
   3668-QPYBK    Male              0      No         No       2   
   7795-CFOCW    Male              0      No         No      45   
   9237-HQITU  Female              0      No         No       2   
...          ...     ...            ...     ...        ...     ...   
6840-RESVB    Male              0     Yes        Yes      24   
2234-XADUH  Female              0     Yes        Yes      72   
4801-JZAZL  Female              0     Yes        Yes      11   
8361-LTMKD    Male              1     Yes         No       4   
3186-AJIEK    Male              0      No         No      66   

     PhoneService     MultipleLines InternetService OnlineSecurity  ...  \
            No  No phone service             DSL             No  ...   
           Yes                No             DSL            Yes  ...   
           Yes                No             DSL            Yes  ...   
            No  No phone service             DSL            Yes  ...   
           Yes                No     Fiber optic             No  ...   
...           ...               ...             ...            ...  ...   
        Yes               Yes             DSL            Yes  ...   
        Yes               Yes     Fiber optic             No  ...   
         No  No phone service             DSL            Yes  ...   
        Yes               Yes     Fiber optic             No  ...   
        Yes                No     Fiber optic            Yes  ...   
...
  No  
 Yes  
  No  

[7043 rows x 21 columns]>
Output is truncated. View

Data Introduction:

After using Pandas in Python to read the data, we discovered no missing information from the raw data set and that most of the features-including gender, phone service, and payment method-were all categorical data. Both Monthly Charges and Total Charges are expressed as numbers.

Correlation:

Following the conversion of all the categorical data using an encoder and label encoding, we performed a pairwise correlation for each feature:

The heatmap showed us a strong link between the characteristics "Contract" and "Tenure." It makes sense because these features gauge a customer's level of commitment.

There is a strong association between "Multiple Lines," "StreamingTV," "StreamingMovie," and "Monthly Charges." This, we believe, is because those who stream films are more inclined to watch television. Because kids consume so much data when viewing TV episodes or films, their monthly costs often increase. Customers with several lines on their account will probably pay more than those with just one line.

Data Analyses and Key Findings

Simple Linear Regression, Binomial Logit Regression, Binomial Probit Regression, and Random Forest Regression are the four techniques we have selected for our data.

Overview of the Model

Let's begin by describing the basic linear regression model, which was our first pick. The target is predicted using a linear regression model as the weighted sum of the feature inputs. Since linear regression serves as our standard accuracy and point of comparison, its simplicity and convenience of usage account for most of its benefits and drawbacks.

Random forest is our final and fourth model, a widely used machine learning model. The decision trees that comprise the random forest model are many individual trees that function together.

The advantages in our situation are as follows: (1) It typically offers excellent accuracy and strikes a good balance between bias and volatility. (2) It is applicable as a visualisation of feature relevance. (3) Outliers have little to no impact on it. (4) Both linear and nonlinear connections are supported. Cons include the following: (1) It is far more difficult to comprehend than earlier models. (2) If the dataset is large, it will take a lot longer.

Source code:

Import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings("ignore")
#import lux
df=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()
df.shape
df.columns
df.describe
df.info(verbose=1, null_counts=True, memory_usage=True)

s = df.shape
print(f'The dataset contains {s[0]} rows and {s[1]-1} independent columns and 1 target variable')
We need to convert SeniorCitizen to object and TotalCharges to float datatype
# Assuming df is your DataFrame
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['SeniorCitizen'] = df['SeniorCitizen'].astype(str)

df.dtypes
Checking missing values
df.isnull().sum()
#NO missing values found
#TARGET VARIABLE
df['Churn'].value_counts()
df['Churn'].value_counts(normalize=True)

#The normalize=True parameter will return the relative frequencies of unique values, giving you a proportion instead of raw counts.

The proportion of churned customers are far less than the existing customers. So, from the total customers given in this dataset, 26% of the customers have left the telecom services.

Now, let's visualize each variable separately. Different types of variables are Categorical, ordinal and numerical.

Categorical Variables:

customerID (Assuming it's an identifier and not used as a feature)

Gender
Partner
Dependents
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
StreamingMovies
Contract
PaperlessBilling
payment method

Ordinal Variables:

SeniorCitizen (Assuming it's a binary variable, but its ordinality might depend on the specific context)

Numerical Variables:

tenure
MonthlyCharges
TotalCharges

# drop customer because it's just the customer ID
df.drop(['customerID'], axis=1, inplace=True)

Data Visualization Independent Variable (Categorical) - to check **OUTLIERS**

Source code:

import matplotlib.pyplot as plt
import seaborn as sns
object_columns = df.select_dtypes(include='object').columns
num_subplots = len(object_columns)
# Create subplots dynamically based on the number of object columns
fig, axes = plt. subplots((num_subplots + 2) // 3, 3, figsize=(20, 20))
axes = axes.flatten()
for i, col in enumerate(object_columns):
    ax = axes[i]
    ax.set_title(f'Subplot {i + 1}')
    # Using Seaborn countplot
    sns.countplot(x=col, data=df, ax=ax)
    # Annotating each bar with the percentage
    total = len(df[col])
    for p in ax.patches:
        height = p.get_height()
        percentage = f'{height / total:.2%}'
        ax.annotate(percentage,
                    xy=(p.get_x() + p.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

# Adjust layout and show the plot
plt.tight_layout()
plt.show()

Output:

Observation: No outliers found in cat cols

Detecting outliers in categorical columns is a bit different than in numerical columns. In categorical columns, you don't typically have the notion of "outliers" in the same way you do with numerical values. However, you can check for unusual or rare categories that might be considered as outliers based on their frequency.

Here are some approaches:

Value Counts: Check the distribution of each category in your categorical columns using value_counts(). If you see a category with significantly lower frequency than others, you might consider it unusual or rare.
Bar Plots: Visualize the distribution of categories using bar plots. This can help you quickly identify categories with low frequencies.
Rare Category Aggregation: If there are categories with very low frequencies, you might consider aggregating them into a single category to simplify your analysis.
Check for Missing Values: Sometimes, missing values in categorical columns can be considered a special category. Check if there are any unexpected missing values.

Remember that the definition of "outliers" in categorical columns is somewhat subjective and depends on the context of your data. The goal is to identify categories that are rare or have unusual patterns.

Independent Variable (Numerical) - to check OUTLIERS

num_cols=df.select_dtypes(["int", "float"]).columns
num_cols
categorical_cols=df.select_dtypes(["object", "bool"]).columns
fig, axes = plt.subplots(nrows=1, ncols =3, figsize=(25,20))
plt.subplots_adjust(hspace=0.5)
for i , feature in enumerate(num_cols):
    sns.boxplot(data=df, y=feature, ax = axes[i], orient='v')
    axes[i].set_title(f" Distribution for {feature}")
plt.tight_layout()
plt.show()
Insights:
no outliers found in numerical cols
Categorical Independent Variable v/s Target Variable
fig, axes = plt.subplots(nrows=5, ncols =4)
plt.subplots_adjust(hspace=0.9)
for i , feature in enumerate(categorical_cols):
    row_index = i//4
    col_index = i%4
    plot=pd.crosstab(df[feature],df['Churn'])
    plot.div(plot.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(12,20), ax = axes[row_index, col_index])
    axes[row_index, col_index].set_title(f" Distribution for {feature}")
plt.tight_layout()
plt.show()
Numerical Independent Variable vs Target Variable
plt.figure(1)
plt.subplot(1, 2, 1)
a = df.groupby('Churn')['tenure'].median().plot.bar()
plt.bar_label(a.containers[0])
plt.figure(1)
plt.subplot(1, 2, 1)
a = df.groupby('Churn')['TotalCharges'].median().plot.bar()
plt.bar_label(a.containers[0])
#MonthlyCharges
df['Churn'].replace("No", 0, inplace=True)
df['Churn'].replace("Yes", 1, inplace=True)
Churn (customer activity) variable - if the yes then 1 else 0

Now lets look at the correlation between all the numerical variables. We will use the heat map to visualize the correlation. Heatmaps visualize data through variations in coloring. The variables with darker color means their correlation is more.

matrix = df[df.select_dtypes(["int","float"]).columns].corr()
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(matrix, vmax=.8, square=True, cmap="BuPu")

Domain-Specific Analysis:

Depending on the domain and business context, investigate why there is a positive correlation. Are there specific business practices or reasons that explain this relationship? Understanding the context can provide valuable insights.

Predictive Modeling:

If your goal is to build a predictive model, consider whether having both tenure and total_charges as features is redundant due to their high correlation. In some cases, you might choose to keep one of the features or apply dimensionality reduction techniques

## XGBOOST
from xgboost import XGBClassifier
xgb_model = XGBClassifier().fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## XgBOOST tuning
xgb = XGBClassifier()
xgb_params = {"n_estimators": [50, 100, 300], "subsample":[0.5,0.8,1], "max_depth":[3,5,7], "learning_rate":[0.1,0.01,0.3]}
xgb_cv_model = GridSearchCV(xgb, xgb_params, cv = 3, n_jobs = -1, verbose = 2).fit(X_train, y_train)

xgb_cv_model.best_params_
xgb_tuned = XGBClassifier(learning_rate= 0.01, max_depth= 5, n_estimators= 450, subsample= 0.5).fit(X_train, y_train)

y_pred = xgb_tuned.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
## SVM Support vector classifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
svc_model_sc = SVC().fit(X_train_s, y_train)
y_pred = svc_model_sc.predict(X_test_s)
cnf_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred))
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test, y_pred))
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf', 'linear']}
from sklearn.model_selection import GridSearchCV
svc_tuned = GridSearchCV(SVC(),param_grid, verbose=3, refit=True)
svc_tuned.fit(X_train_s, y_train)
print(svc_tuned.best_params_)
print(svc_tuned.best_estimator_)
y_pred = svc_tuned.predict(X_test_s)
cnf_matrix = confusion_matrix(y_test, y_pred)
print(cnf_matrix)
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test,y_pred))
## Logistic Regression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred = log_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
log_model_sc = LogisticRegression()
log_model_sc.fit(X_train_s, y_train)
y_pred = log_model_sc.predict(X_test_s)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
## KNN
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
knn_model_sc = KNeighborsClassifier(n_neighbors=1).fit(X_train_s, y_train)

y_pred = knn_model_sc.predict(X_test_s)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
error_rate = []
for i in range(1, 40):
    model = KNeighborsClassifier(n_neighbors = i)
    model.fit(X_train_s, y_train)
    y_pred_i = model.predict(X_test_s)
    error_rate.append(np.mean(y_pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
knn_model_sc_tuned = KNeighborsClassifier(n_neighbors=38).fit(X_train_s, y_train)
y_pred = knn_model_sc_tuned.predict(X_test_s)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
train = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
test = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
my_report = sweetviz.compare([train, "Train"], [test, "Test"], "Churn")
my_report.show_html("Report.html") # Not providing a filename will default to SWEETVIZ_REPORT.html
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf', 'linear']}
from sklearn.model_selection import GridSearchCV
svc_tuned = GridSearchCV(SVC(),param_grid, verbose=3, refit=True)
svc_tuned.fit(X_train_s, y_train)
print(svc_tuned.best_params_)
print(svc_tuned.best_estimator_)
y_pred = svc_tuned.predict(X_test_s)
cnf_matrix = confusion_matrix(y_test, y_pred)
print(cnf_matrix)
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test,y_pred))

Output:

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   2.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.8s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.8s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.9s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.734 total time=   1.5s
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.823 total time=   0.4s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.809 total time=   0.3s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.800 total time=   0.4s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.791 total time=   0.4s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.806 total time=   0.3s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.801 total time=   0.8s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.783 total time=   0.8s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.794 total time=   0.8s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.769 total time=   0.8s

Source Code:

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
svc_model_sc = SVC().fit(X_train_s, y_train)
y_pred = svc_model_sc.predict(X_test_s)

cnf_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred))
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test, y_pred))

Output:

Limitations

The following restrictions apply to our model and dataset and the research's constraints.
While the number of observations is respectable, we might learn more from the outcome if there were additional columns with attributes like the location of the clients, competition data, and other pertinent details.
There are more potent models outside of our range, but we picked ours based not just on its complexity and predictive ability but also-and this is more crucial-on its simplicity of interpretation. For instance, neural networks with strong gradient boosting may function far better and produce higher accuracy.
Our dataset has a cross-sectional structure. This indicates that it is devoid of time series components. Our objective is to forecast the churn rate so that we may choose between monthly, one-year, or two-year contracts. Finding a time series dataset with all the client data going back up to two years would be ideal if we wanted to improve our ability to forecast and judge the future market.

Next TopicAlternatives to the bar chart

← prev next →