Telco Customer ChurnRate Analysis

In this Tutorial, we will go over how we developed simple yet practical models to account for the churn rate using the Kaggle Telco Customer dataset.

  1. Background and Problem;
  2. Data Summary and Exploratory Analysis;
  3. Data Analyses;
  4. Strategy Recommendations,

The drawbacks, and Future Research are all included in the particular procedure.

Background

Given the significant rise in the number of consumers utilising phone services, a telecom company's marketing department aims to keep existing customers from terminating their contracts while bringing in more new ones. The telecom company's growth rate must outpace its attrition rate to develop its customer base. Better pricing offerings, quicker internet connections, and a safer online experience from other businesses are a few of the reasons why current clients have left their telecom firms.

A high turnover rate will hurt a business's bottom line and stymie expansion. The telecom business would be able to determine how effectively it is keeping its current customers and identify the fundamental causes of existing consumers terminating their contracts with the help of our churn forecast.

With the help of our study, the telecom business may determine whether or not its offering is more beneficial than its rivals. The business may utilise the churn rate study to provide discounts, exclusive deals, and better products to retain current consumers because it is far less expensive to acquire new ones than to maintain its present clientele.

The Dataset

The data set from the telecom firm, which is derived from the IBM sample set collection, is accessible on Kaggle. In California, the firm serves 7043 consumers with internet and residential services. Helping the business anticipate consumer behaviour to keep them as clients and analysing all pertinent customer data to create targeted customer retention campaigns are our challenges.

The following details are included in the dataset that was provided:

  1. Customer demographic data, such as age, gender, and marital status
  2. Details about the customer's account, such as the number of months they have been with the business, paperless billing, mode of payment, monthly costs, and total charges
  3. The way that customers use the service, such as when they stream TV or films
  4. The client's signed-up services included phone, internet, multiples, online security, internet backup, gadget protection, and tech support.
  5. Customer churn, or the absence of a customer during the previous month

Research Objectives

  1. Among the factors that lead to the high retention rate, which is the most significant?
  2. Which analytics model can accurately forecast a customer's turnover rate?
  3. What are the benefits and drawbacks of employing various analytical models?
  4. What targeted retention initiatives may the telecom firm create using the information we provide?

Rationale for the Study

Our churn research is crucial for the telecom firm to comprehend why the consumer has ceased utilising its product or service. It is difficult for the telecom business to enhance its product and service unless it knows how much income is lost overall due to customer cancellations, which customers are cancelling, and why.

We will analyse customer churn behaviour using Simple Linear Regression, Binomial Logit Regression, binomial reasoning Probit Regression, and Random Forest Regression, as churn rate analysis is a common classification issue within supervised learning.

Our study will assist the business in offering guidance on how to lower customer attrition by focusing on the demographic data, account details, use patterns, and services that customers have signed up for.

Exploratory Analysis and Data Summary

The secondary data we examined is accessible on the free-to-use data aggregation platform Kaggle.

Below code has some of the data connected.

Data Introduction:

After using Pandas in Python to read the data, we discovered no missing information from the raw data set and that most of the features-including gender, phone service, and payment method-were all categorical data. Both Monthly Charges and Total Charges are expressed as numbers.

Correlation:

Following the conversion of all the categorical data using an encoder and label encoding, we performed a pairwise correlation for each feature:

Telco Customer ChurnRate Analysis

The heatmap showed us a strong link between the characteristics "Contract" and "Tenure." It makes sense because these features gauge a customer's level of commitment.

There is a strong association between "Multiple Lines," "StreamingTV," "StreamingMovie," and "Monthly Charges." This, we believe, is because those who stream films are more inclined to watch television. Because kids consume so much data when viewing TV episodes or films, their monthly costs often increase. Customers with several lines on their account will probably pay more than those with just one line.

Data Analyses and Key Findings

Simple Linear Regression, Binomial Logit Regression, Binomial Probit Regression, and Random Forest Regression are the four techniques we have selected for our data.

Overview of the Model

Let's begin by describing the basic linear regression model, which was our first pick. The target is predicted using a linear regression model as the weighted sum of the feature inputs. Since linear regression serves as our standard accuracy and point of comparison, its simplicity and convenience of usage account for most of its benefits and drawbacks.

Random forest is our final and fourth model, a widely used machine learning model. The decision trees that comprise the random forest model are many individual trees that function together.

The advantages in our situation are as follows: (1) It typically offers excellent accuracy and strikes a good balance between bias and volatility. (2) It is applicable as a visualisation of feature relevance. (3) Outliers have little to no impact on it. (4) Both linear and nonlinear connections are supported. Cons include the following: (1) It is far more difficult to comprehend than earlier models. (2) If the dataset is large, it will take a lot longer.

Source code:

#The normalize=True parameter will return the relative frequencies of unique values, giving you a proportion instead of raw counts.

The proportion of churned customers are far less than the existing customers. So, from the total customers given in this dataset, 26% of the customers have left the telecom services.

Now, let's visualize each variable separately. Different types of variables are Categorical, ordinal and numerical.

Categorical Variables:

customerID (Assuming it's an identifier and not used as a feature)

  • Gender
  • Partner
  • Dependents
  • PhoneService
  • MultipleLines
  • InternetService
  • OnlineSecurity
  • OnlineBackup
  • DeviceProtection
  • TechSupport
  • StreamingTV
  • StreamingMovies
  • Contract
  • PaperlessBilling
  • payment method

Ordinal Variables:

SeniorCitizen (Assuming it's a binary variable, but its ordinality might depend on the specific context)

Numerical Variables:

  1. tenure
  2. MonthlyCharges
  3. TotalCharges

Data Visualization Independent Variable (Categorical) - to check **OUTLIERS**

Source code:

Output:

Telco Customer ChurnRate Analysis

Observation: No outliers found in cat cols

Detecting outliers in categorical columns is a bit different than in numerical columns. In categorical columns, you don't typically have the notion of "outliers" in the same way you do with numerical values. However, you can check for unusual or rare categories that might be considered as outliers based on their frequency.

Here are some approaches:

  • Value Counts: Check the distribution of each category in your categorical columns using value_counts(). If you see a category with significantly lower frequency than others, you might consider it unusual or rare.
  • Bar Plots: Visualize the distribution of categories using bar plots. This can help you quickly identify categories with low frequencies.
  • Rare Category Aggregation: If there are categories with very low frequencies, you might consider aggregating them into a single category to simplify your analysis.
  • Check for Missing Values: Sometimes, missing values in categorical columns can be considered a special category. Check if there are any unexpected missing values.

Remember that the definition of "outliers" in categorical columns is somewhat subjective and depends on the context of your data. The goal is to identify categories that are rare or have unusual patterns.

Independent Variable (Numerical) - to check OUTLIERS

Now lets look at the correlation between all the numerical variables. We will use the heat map to visualize the correlation. Heatmaps visualize data through variations in coloring. The variables with darker color means their correlation is more.

Domain-Specific Analysis:

Depending on the domain and business context, investigate why there is a positive correlation. Are there specific business practices or reasons that explain this relationship? Understanding the context can provide valuable insights.

Predictive Modeling:

If your goal is to build a predictive model, consider whether having both tenure and total_charges as features is redundant due to their high correlation. In some cases, you might choose to keep one of the features or apply dimensionality reduction techniques

Output:

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   2.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.8s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.8s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.9s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.734 total time=   1.5s
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.823 total time=   0.4s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.809 total time=   0.3s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.800 total time=   0.4s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.791 total time=   0.4s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.806 total time=   0.3s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.801 total time=   0.8s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.783 total time=   0.8s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.794 total time=   0.8s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.769 total time=   0.8s

Source Code:

Output:

Telco Customer ChurnRate Analysis

Limitations

  1. The following restrictions apply to our model and dataset and the research's constraints.
  2. While the number of observations is respectable, we might learn more from the outcome if there were additional columns with attributes like the location of the clients, competition data, and other pertinent details.
  3. There are more potent models outside of our range, but we picked ours based not just on its complexity and predictive ability but also-and this is more crucial-on its simplicity of interpretation. For instance, neural networks with strong gradient boosting may function far better and produce higher accuracy.
  4. Our dataset has a cross-sectional structure. This indicates that it is devoid of time series components. Our objective is to forecast the churn rate so that we may choose between monthly, one-year, or two-year contracts. Finding a time series dataset with all the client data going back up to two years would be ideal if we wanted to improve our ability to forecast and judge the future market.