Churn Prediction Project Report

Churn Prediction for a Telecom Company

Imagine that we are working at a telecom company that offers phone and internet services, and we have a problem: some of our customers are churning. They are no longer using our services and are going to a different provider. We would like to prevent that from happening, so we develop a system for identifying these customers and offering them an incentive to stay.

We want to target them with promotional messages and give them a discount. We also would like to understand why the model thinks our customers churn, and for that, we need to be able to interpret the model’s predictions. We have collected a dataset where we’ve recorded some information about our customers: what type of services they used, how much they paid, and how long they stayed with us.

We also know who canceled their contracts and stopped using our services (churned). We will use this information as the target variable in the machine learning model and predict it using all other available information. 

Problem Definition and Algorithm 

Task Definition

To predict behavior to retain customers by analyzing all relevant customer data and developing focused customer retention programs.

Algorithm Definition

Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.

Experimental Evaluation 

Methodology

The dataset is divided into test and train parts which are 80% for the training set and 20% for the test. Each row represents a customer, each column contains the customer’s attributes described in the column Metadata. The raw data contains 7043 rows (customers) and 21 columns (features).

The “Churn” column is our target or output variable.

EDA

EDA is performed on the following relations with “Churn”

  1. Churn Rate
  2. Demographic Information: Gender, Senior Citizen, Partner, Dependents
  3. Service Information: PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, treamingTV, StreamingMovies
  4. Account Information: tenure, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges

ETL 

  • Datatype :  Datatypes were transformed from ‘str’ to int for future for some features
  • Null/Empty columns : NA values were filled with appropriate values
  • Underlying Feature Extraction of categorical variables for One-hot-code
    • contract :  month-to-month, one_year, two_year
    • deviceprotection : no, no_internet_service, yes
    • gender : male, female
    • internetservice : dsl, fiber_optics, no
    • multiplelines  : no, yes, no_phone_service
    • onlinebackup : no, yes, no_internet_service
    • onlinesecurity : no, yes, no_internet_service
    • paymentmethod : bank_transfer_(automatic), credit_card_(automatic), electronic_check, mailed_check
    • streamingmovies : no, yes, no_internet_service
    • streamingtv : no, yes, no_internet_service
    • techsupport : no, yes, no_internet_service

Note : Columns with only two values like ‘yes’, ‘no’ were simply catagorized in two columns

  • Dict-to-vector : All columns ready for prediction usage were changed to dictionary and then vectorized using sklearn.feature_extraction

Results

  1. The global churning rate is 27% 
  2. The trend is downward: for higher tenure values, the churn rate is smaller.
  3. The trend is upward: for higher values of monthly charges, the churn rate is higher.
  4. Senior citizens are more likely to churn.
  5. Customers with tech support and good internet service churn less

Model Accuracy: 80%

Discussion

The model outputs probabilities, not hard predictions. To binarize the output, we cut the predictions at a certain threshold. If the probability is greater than or equal to 0.5, we predict True (churn), and False (no churn) otherwise. This allows us to use the model for solving our problem: predicting customers who churn. 

The weights of the logistic regression model are easy to interpret and explain, especially when it comes to the categorical variables encoded using the one-hot encoding scheme. It helps us understand the behavior of the model better and explain to others what it’s doing and how it’s working.

Related Work

Here is the work on the same dataset contributed on Kaggle:

https://www.kaggle.com/datasets/blastchar/telco-customer-churn/code

Future Work for Churn Prediction

We can try the different feature relations with churning. Also, by using some unsupervised learning algorithm, we can extract the underlying pattern for the churning customer.

Committed to Delivering the best

Thousands of AWS and CNCF-certified Kubernetes solution partners have unique expertise and focus areas. Our focus is on best practices in security, automation, and excellence in Cloud-based operations.

Please reach out to us if you have any questions.

Social Share :

AWS Inspector vs Guardduty

Both Amazon Inspector and Amazon GuardDuty are services that enhance your cloud security posture. Both…

AWS Cost Optimization Best Practices

Introduction Cloud costs are a daily concern for companies running applications on Amazon Web Services…

Python-based MultiParty Computation

How multiparty computation can be built quickly using the PyMPC library? This document describes how…

Ready to make your business more efficient?