Market Segmentation Using Clustering Project

01

I framed segmentation as a marketing decision

I started the case from the management problem rather than the algorithm. A one-size-fits-all marketing approach wastes attention because the same offer will not mean the same thing to a VIP customer, a regular buyer, and a lapsed account. My objective was to create customer groups that were analytically defensible and easy enough for a marketing team to act on.

02

I rebuilt transaction rows into customer behavior

The source file contained invoice-line records, not customer segments. I removed canceled invoices, non-positive quantities, non-positive prices, and missing customer IDs. Then I aggregated the cleaned transactions into RFM variables: Recency measured days since the latest purchase, Frequency counted distinct invoices, and Monetary captured total customer spend. This step was the bridge from raw retail operations to customer analytics.

03

I selected four clusters with two diagnostics

Before fitting the final model, I tested the number of clusters instead of guessing it. The elbow method showed a clear slowdown in within-cluster improvement after four clusters, while the silhouette method reached its strongest value at four clusters. Together, those checks gave me a reasonable balance between interpretability and separation.

Elbow method chart for selecting the number of customer clusters — The elbow curve shows diminishing improvement after four clusters.

Silhouette method chart for selecting the number of customer clusters — The silhouette diagnostic supports the four-cluster solution used in the final model.

04

I translated clusters into customer profiles

The four-cluster solution gave the case its business value. Cluster 4 was the tiny VIP group: only 13 customers, but with 82.5 average purchase frequency and 127,338 average monetary value. Cluster 1 contained 209 active high-value customers with strong frequency and spend. Cluster 2 was the mainstream base, with 3,055 regular customers and moderate value. Cluster 3 contained 1,061 inactive or low-value customers, with high recency and low spend. That profile made the recommendation practical: protect VIPs, grow active customers, nurture regulars, and use low-cost reactivation for lapsed accounts.

05

I visualized the segmentation with PCA

Because k-means works in standardized RFM space, I used PCA to show the customer structure in two dimensions. The visualization does not pretend that real customers separate perfectly; instead, it shows a realistic pattern where lower-value segments overlap while the high-value group pulls away strongly. That made the result easier to explain in the report: useful segmentation does not need perfect geometry, but it does need meaningful behavioral differences.

PCA visualization of customer clusters from the RFM model — The PCA plot shows partial but meaningful separation among the four RFM-based customer clusters.

06

I used hierarchical clustering as a robustness check

I did not want to rely on k-means alone, so I also applied hierarchical clustering with Euclidean distance and Ward linkage. The dendrogram helped me inspect how customers merged step by step, and cutting the tree at four clusters produced a structure broadly consistent with the k-means solution. The exact membership was not identical, but the overall pattern supported the same segmentation story.

Hierarchical clustering dendrogram with a four-cluster cut — Ward hierarchical clustering gave me a second view of the same customer-grouping structure.

07

I connected the model to marketing action

The final recommendation was not "use clustering" in the abstract. It was a segment strategy. VIP customers deserve retention attention and personalized service because they generate disproportionate value. Active high-value customers are strong loyalty-program candidates. Regular customers can be nudged with targeted offers that increase frequency or basket size. Inactive customers should receive lower-cost reactivation campaigns, because heavy marketing spend may not be justified for every lapsed account.

08

I kept the limitations visible

The case is intentionally honest about what the model can and cannot say. RFM is interpretable, but it only captures purchasing behavior. It does not include demographics, browsing behavior, campaign history, product-category preferences, or customer feedback. K-means also assumes distance-based, roughly spherical structure. In future work, I would add richer behavioral variables, test model-based or density-based clustering, and track how customers move between segments over time.

09

What this project says about how I work

This case taught me to think like the person who has to defend the analysis after the chart is finished. I was responsible for the cleaning choices, the feature design, the cluster validation, the interpretation, and the final business story. The value of the project is not only that I produced four clusters. It is that I turned messy transaction data into a clear segmentation framework that a non-technical stakeholder could understand and use.

Market Segmentation Using Clustering

The Challenge

The Approach

How it works

I framed segmentation as a marketing decision

I rebuilt transaction rows into customer behavior

I selected four clusters with two diagnostics

I translated clusters into customer profiles

I visualized the segmentation with PCA

I used hierarchical clustering as a robustness check

I connected the model to marketing action

I kept the limitations visible

What this project says about how I work

Results

Key features

Tech stack

Interested in similar work?

Portfolio menu

Market Segmentation Using Clustering

The Challenge

The Approach

How it works

I framed segmentation as a marketing decision

I rebuilt transaction rows into customer behavior

I selected four clusters with two diagnostics

I translated clusters into customer profiles

I visualized the segmentation with PCA

I used hierarchical clustering as a robustness check

I connected the model to marketing action

I kept the limitations visible

What this project says about how I work

Results

Key features

Tech stack

Interested in similar work?