Telco Customer ChurnRate AnalysisIn this Tutorial, we will go over how we developed simple yet practical models to account for the churn rate using the Kaggle Telco Customer dataset.
The drawbacks, and Future Research are all included in the particular procedure. BackgroundGiven the significant rise in the number of consumers utilising phone services, a telecom company's marketing department aims to keep existing customers from terminating their contracts while bringing in more new ones. The telecom company's growth rate must outpace its attrition rate to develop its customer base. Better pricing offerings, quicker internet connections, and a safer online experience from other businesses are a few of the reasons why current clients have left their telecom firms. A high turnover rate will hurt a business's bottom line and stymie expansion. The telecom business would be able to determine how effectively it is keeping its current customers and identify the fundamental causes of existing consumers terminating their contracts with the help of our churn forecast. With the help of our study, the telecom business may determine whether or not its offering is more beneficial than its rivals. The business may utilise the churn rate study to provide discounts, exclusive deals, and better products to retain current consumers because it is far less expensive to acquire new ones than to maintain its present clientele. The DatasetThe data set from the telecom firm, which is derived from the IBM sample set collection, is accessible on Kaggle. In California, the firm serves 7043 consumers with internet and residential services. Helping the business anticipate consumer behaviour to keep them as clients and analysing all pertinent customer data to create targeted customer retention campaigns are our challenges. The following details are included in the dataset that was provided:
Research Objectives
Rationale for the Study Our churn research is crucial for the telecom firm to comprehend why the consumer has ceased utilising its product or service. It is difficult for the telecom business to enhance its product and service unless it knows how much income is lost overall due to customer cancellations, which customers are cancelling, and why. We will analyse customer churn behaviour using Simple Linear Regression, Binomial Logit Regression, binomial reasoning Probit Regression, and Random Forest Regression, as churn rate analysis is a common classification issue within supervised learning. Our study will assist the business in offering guidance on how to lower customer attrition by focusing on the demographic data, account details, use patterns, and services that customers have signed up for. Exploratory Analysis and Data Summary The secondary data we examined is accessible on the free-to-use data aggregation platform Kaggle. Below code has some of the data connected. Data Introduction:After using Pandas in Python to read the data, we discovered no missing information from the raw data set and that most of the features-including gender, phone service, and payment method-were all categorical data. Both Monthly Charges and Total Charges are expressed as numbers. Correlation: Following the conversion of all the categorical data using an encoder and label encoding, we performed a pairwise correlation for each feature: The heatmap showed us a strong link between the characteristics "Contract" and "Tenure." It makes sense because these features gauge a customer's level of commitment. There is a strong association between "Multiple Lines," "StreamingTV," "StreamingMovie," and "Monthly Charges." This, we believe, is because those who stream films are more inclined to watch television. Because kids consume so much data when viewing TV episodes or films, their monthly costs often increase. Customers with several lines on their account will probably pay more than those with just one line. Data Analyses and Key FindingsSimple Linear Regression, Binomial Logit Regression, Binomial Probit Regression, and Random Forest Regression are the four techniques we have selected for our data. Overview of the Model Let's begin by describing the basic linear regression model, which was our first pick. The target is predicted using a linear regression model as the weighted sum of the feature inputs. Since linear regression serves as our standard accuracy and point of comparison, its simplicity and convenience of usage account for most of its benefits and drawbacks. Random forest is our final and fourth model, a widely used machine learning model. The decision trees that comprise the random forest model are many individual trees that function together. The advantages in our situation are as follows: (1) It typically offers excellent accuracy and strikes a good balance between bias and volatility. (2) It is applicable as a visualisation of feature relevance. (3) Outliers have little to no impact on it. (4) Both linear and nonlinear connections are supported. Cons include the following: (1) It is far more difficult to comprehend than earlier models. (2) If the dataset is large, it will take a lot longer. Source code: #The normalize=True parameter will return the relative frequencies of unique values, giving you a proportion instead of raw counts. The proportion of churned customers are far less than the existing customers. So, from the total customers given in this dataset, 26% of the customers have left the telecom services. Now, let's visualize each variable separately. Different types of variables are Categorical, ordinal and numerical. Categorical Variables:customerID (Assuming it's an identifier and not used as a feature)
Ordinal Variables:SeniorCitizen (Assuming it's a binary variable, but its ordinality might depend on the specific context) Numerical Variables:
Data Visualization Independent Variable (Categorical) - to check **OUTLIERS** Source code: Output: Observation: No outliers found in cat cols Detecting outliers in categorical columns is a bit different than in numerical columns. In categorical columns, you don't typically have the notion of "outliers" in the same way you do with numerical values. However, you can check for unusual or rare categories that might be considered as outliers based on their frequency. Here are some approaches:
Remember that the definition of "outliers" in categorical columns is somewhat subjective and depends on the context of your data. The goal is to identify categories that are rare or have unusual patterns. Independent Variable (Numerical) - to check OUTLIERS Now lets look at the correlation between all the numerical variables. We will use the heat map to visualize the correlation. Heatmaps visualize data through variations in coloring. The variables with darker color means their correlation is more. Domain-Specific Analysis: Depending on the domain and business context, investigate why there is a positive correlation. Are there specific business practices or reasons that explain this relationship? Understanding the context can provide valuable insights. Predictive Modeling: If your goal is to build a predictive model, consider whether having both tenure and total_charges as features is redundant due to their high correlation. In some cases, you might choose to keep one of the features or apply dimensionality reduction techniques Output: Fitting 5 folds for each of 50 candidates, totalling 250 fits [CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time= 2.1s [CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time= 1.8s [CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time= 1.8s [CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time= 1.9s [CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.734 total time= 1.5s [CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.823 total time= 0.4s [CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.809 total time= 0.3s [CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.800 total time= 0.4s [CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.791 total time= 0.4s [CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.806 total time= 0.3s [CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.801 total time= 0.8s [CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.783 total time= 0.8s [CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.794 total time= 0.8s [CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.769 total time= 0.8s Source Code: Output: Limitations
Next TopicAlternatives to the bar chart |
We provides tutorials and interview questions of all technology like java tutorial, android, java frameworks
G-13, 2nd Floor, Sec-3, Noida, UP, 201301, India