MBS Mohammed Baobaid
All projects
Project
University project Data Analytics Created 9 February 2026 at 3:28 PM

Market Segmentation Using Clustering

A reproducible customer segmentation case where I transformed retail transactions into RFM features, selected a four-cluster structure, and translated the segments into marketing actions.

Created 9 February 2026 at 3:28 PM, this BANA482 case study was my way of turning a classic retail transaction dataset into something a marketing team could actually use. I cleaned the Online Retail dataset, engineered customer-level Recency, Frequency, and Monetary features, compared k-means with hierarchical clustering, and wrote the findings as a business segmentation story rather than a purely technical exercise.

R R Markdown readxl dplyr lubridate ggplot2 factoextra cluster K-means clustering Hierarchical clustering PCA RFM analysis
Narrated walkthrough

This audio is not a word-for-word copy of the case below. You can read the written case while listening to me explain the project in more detail.

0:00 / 0:00
Speed
Market Segmentation Using Clustering project preview
541,909 Raw rows
397,884 Clean rows
4,338 Customers
4 Segments

Role

Lead analytics author and clustering workflow builder with Hamed Al-Saedi and Majid Tayfour

Outcome

I reduced 541,909 raw transaction rows to 397,884 valid records, built RFM profiles for 4,338 customers, and identified four customer segments ranging from inactive low-value accounts to a tiny VIP segment with exceptionally high purchase frequency and monetary value.

The Challenge

The business problem was simple but important: the retailer was treating customers as if they behaved the same, even though campaign response and purchasing behavior were clearly uneven. The analytical challenge was to move from transaction rows to a customer-level view that could separate active loyal buyers, regular customers, inactive customers, and high-value VIPs without relying on labels or supervised outcomes.

The Approach

I treated the case as a compact unsupervised-learning workflow. First, I cleaned the raw Online Retail data and removed records that would distort customer behavior. Then I engineered RFM features, scaled them, tested the number of clusters with elbow and silhouette diagnostics, fitted a four-cluster k-means model, profiled each segment, and used PCA plus hierarchical clustering to check whether the solution was interpretable and reasonably stable.

How it works

I framed segmentation as a marketing decision

I started the case from the management problem rather than the algorithm. A one-size-fits-all marketing approach wastes attention because the same offer will not mean the same thing to a VIP customer, a regular buyer, and a lapsed account. My objective was to create customer groups that were analytically defensible and easy enough for a marketing team to act on.

I rebuilt transaction rows into customer behavior

The source file contained invoice-line records, not customer segments. I removed canceled invoices, non-positive quantities, non-positive prices, and missing customer IDs. Then I aggregated the cleaned transactions into RFM variables: Recency measured days since the latest purchase, Frequency counted distinct invoices, and Monetary captured total customer spend. This step was the bridge from raw retail operations to customer analytics.

I selected four clusters with two diagnostics

Before fitting the final model, I tested the number of clusters instead of guessing it. The elbow method showed a clear slowdown in within-cluster improvement after four clusters, while the silhouette method reached its strongest value at four clusters. Together, those checks gave me a reasonable balance between interpretability and separation.

Elbow method chart for selecting the number of customer clusters
The elbow curve shows diminishing improvement after four clusters.
Silhouette method chart for selecting the number of customer clusters
The silhouette diagnostic supports the four-cluster solution used in the final model.

I translated clusters into customer profiles

The four-cluster solution gave the case its business value. Cluster 4 was the tiny VIP group: only 13 customers, but with 82.5 average purchase frequency and 127,338 average monetary value. Cluster 1 contained 209 active high-value customers with strong frequency and spend. Cluster 2 was the mainstream base, with 3,055 regular customers and moderate value. Cluster 3 contained 1,061 inactive or low-value customers, with high recency and low spend. That profile made the recommendation practical: protect VIPs, grow active customers, nurture regulars, and use low-cost reactivation for lapsed accounts.

I visualized the segmentation with PCA

Because k-means works in standardized RFM space, I used PCA to show the customer structure in two dimensions. The visualization does not pretend that real customers separate perfectly; instead, it shows a realistic pattern where lower-value segments overlap while the high-value group pulls away strongly. That made the result easier to explain in the report: useful segmentation does not need perfect geometry, but it does need meaningful behavioral differences.

PCA visualization of customer clusters from the RFM model
The PCA plot shows partial but meaningful separation among the four RFM-based customer clusters.

I used hierarchical clustering as a robustness check

I did not want to rely on k-means alone, so I also applied hierarchical clustering with Euclidean distance and Ward linkage. The dendrogram helped me inspect how customers merged step by step, and cutting the tree at four clusters produced a structure broadly consistent with the k-means solution. The exact membership was not identical, but the overall pattern supported the same segmentation story.

Hierarchical clustering dendrogram with a four-cluster cut
Ward hierarchical clustering gave me a second view of the same customer-grouping structure.

I connected the model to marketing action

The final recommendation was not "use clustering" in the abstract. It was a segment strategy. VIP customers deserve retention attention and personalized service because they generate disproportionate value. Active high-value customers are strong loyalty-program candidates. Regular customers can be nudged with targeted offers that increase frequency or basket size. Inactive customers should receive lower-cost reactivation campaigns, because heavy marketing spend may not be justified for every lapsed account.

I kept the limitations visible

The case is intentionally honest about what the model can and cannot say. RFM is interpretable, but it only captures purchasing behavior. It does not include demographics, browsing behavior, campaign history, product-category preferences, or customer feedback. K-means also assumes distance-based, roughly spherical structure. In future work, I would add richer behavioral variables, test model-based or density-based clustering, and track how customers move between segments over time.

What this project says about how I work

This case taught me to think like the person who has to defend the analysis after the chart is finished. I was responsible for the cleaning choices, the feature design, the cluster validation, the interpretation, and the final business story. The value of the project is not only that I produced four clusters. It is that I turned messy transaction data into a clear segmentation framework that a non-technical stakeholder could understand and use.

Results

  • I cleaned the raw Online Retail file from 541,909 transaction rows to 397,884 valid records for analysis.
  • I built customer-level RFM features for 4,338 customers, making the dataset suitable for segmentation.
  • The elbow curve flattened after four clusters, and the silhouette diagnostic also supported k = 4.
  • Cluster 4 contained only 13 customers but had the highest average value: 6.66 days recency, 82.5 purchase frequency, and 127,338 average monetary value.
  • Cluster 1 represented 209 active high-value customers, with 15.0 average recency, 22.1 average frequency, and 12,510 average monetary value.
  • Cluster 3 captured 1,061 inactive or low-value customers, with 248.0 average recency, 1.55 average frequency, and 478 average monetary value.
  • Hierarchical clustering produced a broadly consistent four-group structure, increasing confidence in the segmentation story.

Key features

01 Cleaned canceled invoices, invalid quantities, invalid prices, and missing customer IDs
02 Aggregated transaction-level data into customer-level RFM features
03 Standardized Recency, Frequency, and Monetary variables before distance-based clustering
04 Used elbow and silhouette diagnostics to select a four-cluster solution
05 Profiled clusters in business language using average recency, frequency, monetary value, and size
06 Visualized customer separation with PCA
07 Used Ward hierarchical clustering as a robustness check
08 Translated segment profiles into retention, loyalty, and reactivation recommendations

Tech stack

R R Markdown readxl dplyr lubridate ggplot2 factoextra cluster K-means clustering Hierarchical clustering PCA RFM analysis
Project links

Interested in similar work?

I build systems like this for teams that need reliable engineering, clean interfaces, and measurable outcomes.