Churn Prediction Project Report

Churn Prediction for a Telecom Company

Imagine that we are working at a telecom company that offers phone and internet services, and we have a problem: some of our customers are churning. They are no longer using our services and are going to a different provider. We would like to prevent that from happening, so we develop a system for identifying these customers and offering them an incentive to stay.

We want to target them with promotional messages and give them a discount. We also would like to understand why the model thinks our customers churn, and for that, we need to be able to interpret the model’s predictions. We have collected a dataset where we’ve recorded some information about our customers: what type of services they used, how much they paid, and how long they stayed with us.

We also know who canceled their contracts and stopped using our services (churned). We will use this information as the target variable in the machine learning model and predict it using all other available information. 

Problem Definition and Algorithm 

Task Definition

To predict behavior to retain customers by analyzing all relevant customer data and developing focused customer retention programs.

Algorithm Definition

Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.

Experimental Evaluation 


The dataset is divided into test and train parts which are 80% for the training set and 20% for the test. Each row represents a customer, each column contains the customer’s attributes described in the column Metadata. The raw data contains 7043 rows (customers) and 21 columns (features).

The “Churn” column is our target or output variable.


EDA is performed on the following relations with “Churn”

  1. Churn Rate
  2. Demographic Information: Gender, Senior Citizen, Partner, Dependents
  3. Service Information: PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, treamingTV, StreamingMovies
  4. Account Information: tenure, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges


  • Datatype :  Datatypes were transformed from ‘str’ to int for future for some features
  • Null/Empty columns : NA values were filled with appropriate values
  • Underlying Feature Extraction of categorical variables for One-hot-code
    • contract :  month-to-month, one_year, two_year
    • deviceprotection : no, no_internet_service, yes
    • gender : male, female
    • internetservice : dsl, fiber_optics, no
    • multiplelines  : no, yes, no_phone_service
    • onlinebackup : no, yes, no_internet_service
    • onlinesecurity : no, yes, no_internet_service
    • paymentmethod : bank_transfer_(automatic), credit_card_(automatic), electronic_check, mailed_check
    • streamingmovies : no, yes, no_internet_service
    • streamingtv : no, yes, no_internet_service
    • techsupport : no, yes, no_internet_service

Note : Columns with only two values like ‘yes’, ‘no’ were simply catagorized in two columns

  • Dict-to-vector : All columns ready for prediction usage were changed to dictionary and then vectorized using sklearn.feature_extraction


  1. The global churning rate is 27% 
  2. The trend is downward: for higher tenure values, the churn rate is smaller.
  3. The trend is upward: for higher values of monthly charges, the churn rate is higher.
  4. Senior citizens are more likely to churn.
  5. Customers with tech support and good internet service churn less

Model Accuracy: 80%


The model outputs probabilities, not hard predictions. To binarize the output, we cut the predictions at a certain threshold. If the probability is greater than or equal to 0.5, we predict True (churn), and False (no churn) otherwise. This allows us to use the model for solving our problem: predicting customers who churn. 

The weights of the logistic regression model are easy to interpret and explain, especially when it comes to the categorical variables encoded using the one-hot encoding scheme. It helps us understand the behavior of the model better and explain to others what it’s doing and how it’s working.

Related Work

Here is the work on the same dataset contributed on Kaggle:

Future Work for Churn Prediction

We can try the different feature relations with churning. Also, by using some unsupervised learning algorithm, we can extract the underlying pattern for the churning customer.

Committed to Delivering the best

Thousands of AWS and CNCF-certified Kubernetes solution partners have unique expertise and focus areas. Our focus is on best practices in security, automation, and excellence in Cloud-based operations.

Please reach out to us if you have any questions.

Social Share :

AWS Security Enhancements

In today’s swiftly evolving tech landscape, prioritizing security is imperative. As a leading cloud service…

Introduction to AWS Migration Hub

The Amazon Web Services (AWS) Migration Hub is a powerful tool. It provides a centralized…

AWS Cloud Adoption Framework

Introduction The AWS Cloud Adoption Framework (AWS CAF) is a comprehensive approach organizations can utilize…

Ready to make your business more efficient?