MBS Mohammed Baobaid
All projects
Project
University project Data Analytics Created January 14, 2026 at 2:13 PM

Telco Customer Churn Prediction

A business-facing churn model that turns telecom customer records into an interpretable retention workflow.

Created on January 14, 2026 at 2:13 PM, this STAT 482 case study treats customer churn as an operating decision rather than a classroom prediction exercise. I rebuilt the Kaggle Telco Customer Churn dataset into a clean modeling table, used logistic regression for explainable churn probabilities, validated the model with ROC/AUC and threshold checks, and translated the evidence into practical retention actions.

R R Markdown tidyverse caret pROC ggplot2 Logistic Regression Kaggle
Narrated walkthrough

This audio is not a word-for-word copy of the case below. You can read the written case while listening to me explain the project in more detail.

0:00 / 0:00
Speed
Telco Customer Churn Prediction project preview
7,043 Customer records
1,760 Holdout customers
0.822 AUC
77.6% Accuracy

Role

Lead analytics author, model builder, and evidence designer with Hamed Al-Saedi and Majid Tayfour

Outcome

The final logistic model reached 77.6% test accuracy and an AUC of 0.822 on 1,760 holdout customers. More importantly, it exposed a clear retention story: short-tenure customers, month-to-month contracts, and fiber-optic service accounts deserved the closest attention.

The Challenge

The real decision was not simply whether a customer might churn. Management needed to know which accounts deserved retention attention, which drivers explained the risk, and how to avoid wasting effort on customers who were unlikely to leave. The case therefore needed a model that was accurate enough to be useful and transparent enough to defend.

The Approach

I treated the analysis as a compact production-style workflow. I cleaned the raw customer table, documented the assumptions behind missing values and recoding, inspected churn patterns before modeling, engineered lifecycle features, fitted an interpretable logistic regression, and evaluated threshold behavior so the model could support retention targeting rather than only report accuracy.

How it works

I started with the retention decision, not the algorithm

The case begins with a management problem: acquisition can look healthy while revenue stalls because existing customers quietly leave. I framed churn prediction as a resource-allocation problem for a retention team. The goal was to identify accounts worth acting on, explain why they were risky, and keep the model transparent enough that the recommendation could be defended in business language.

I turned the analysis into a reproducible workflow

I did not want the work to live only as a final report. The R Markdown workflow in the GitHub repository made the project rerunnable: load the raw customer table, clean the data, engineer features, split the sample, fit the model, evaluate performance, and save the prediction file. That structure made the case closer to a small analytics system than a one-off spreadsheet exercise.

I made the customer table model-ready without hiding assumptions

The source data contained 7,043 customer accounts with demographic, service, contract, billing, and churn fields. The important cleaning decision was TotalCharges: it arrived as text and produced 11 missing values after conversion. I imputed those with the median to preserve the sample size, recoded categorical fields into factors, converted SeniorCitizen into readable yes/no labels, and removed customerID so the model could focus on behavior rather than identifiers.

I used EDA to locate the shape of churn risk

The exploratory layer showed that churn was not randomly scattered across the customer base. About 26.5% of customers churned, with risk concentrated among early-tenure customers, month-to-month contracts, and higher monthly charges. The monthly-charge density plot made that last signal visible: churners were more concentrated in the higher-charge range, while retained customers had a stronger low-charge cluster.

Density plot of monthly charges split by churn status
Customers who churned were more concentrated at higher monthly charges, especially compared with the low-charge retained group.

I used logistic regression because the drivers mattered

This case needed interpretation, not only prediction. Logistic regression let me describe how tenure, contract type, internet service, charges, and lifecycle stage changed churn odds. That mattered because management could act on drivers such as contract commitment and fiber-optic service risk. The odds-ratio view made the result easier to explain than a black-box score.

I validated performance as a decision tradeoff

The holdout set contained 1,760 customers. At the default 0.50 threshold, the model achieved 77.6% accuracy and high specificity, meaning it was strong at recognizing customers who would stay. But sensitivity was lower, so many churners would be missed if the company used the default cutoff. That is why I treated threshold tuning as part of the business decision rather than a technical footnote.

ROC curve for the Telco churn logistic regression model with AUC of 0.822
The ROC curve shows good discriminatory performance, with AUC equal to 0.822.

I converted scores into a retention operating model

The model becomes useful when it changes how a team works. I translated the evidence into four operating moves: move month-to-month customers toward longer contracts, strengthen onboarding during the first year, monitor fiber-optic service quality, and lower the action threshold when the objective is to catch more churners early. The score is the prioritization layer; the driver tells the team what action to take.

I kept the final artifacts reviewable

The final case is supported by the GitHub repository, the written report, the presentation deck, and the R Markdown analysis file. Together they show the business framing, the statistical workflow, and the reproducible implementation. That matters because an analytics case should be inspectable from multiple angles: manager, reviewer, and technical reader.

What this project says about how I work

This case is smaller than the Trading Systems literature-review project, but I approached it with the same discipline: make the assumptions visible, preserve the evidence, explain the model in business terms, and connect the output to decisions. The professional value is not only the AUC. It is the chain from raw customer records to a retention action list that someone else can follow.

Results

  • I converted 7,043 raw customer accounts into a modeling table with a 26.5% churn base rate.
  • I preserved the test-set prediction file for 1,760 customers so the model could be inspected beyond summary metrics.
  • The ROC curve produced an AUC of 0.822, giving strong evidence that the score separates churners from retained customers.
  • Two-year contracts had much lower churn odds than month-to-month contracts, with an odds ratio near 0.155.
  • Fiber-optic customers were about three times more likely to churn than the reference internet-service group.
  • Threshold tuning with Youden's Index raised sensitivity above 80%, which is more aligned with proactive retention than the default 0.50 cutoff.

Key features

01 Built a reproducible R Markdown workflow from raw customer data to model outputs
02 Cleaned 7,043 customer records and preserved the business meaning of each field
03 Converted TotalCharges to numeric and median-imputed 11 missing values
04 Engineered tenure groups to represent lifecycle-stage risk
05 Used logistic regression so drivers could be explained with odds ratios
06 Validated the model with holdout predictions, confusion matrix, ROC curve, and AUC
07 Translated churn probabilities into a practical retention operating model

Tech stack

R R Markdown tidyverse caret pROC ggplot2 Logistic Regression Kaggle
Project links

Interested in similar work?

I build systems like this for teams that need reliable engineering, clean interfaces, and measurable outcomes.