Interpreting cluster variables - raw vs scaled

Question

Interpreting cluster variables - raw vs scaled

The Great

2022年5月27日 12:35

I already referred these posts here and here. I also posted here but since there is no response, am posting here.

Currently, I am working on customer segmentation using their purchase data.

So, my data has below info for each customer

Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc.

But if I scale/normalize all of them to uniform scale, wouldn't I lose the information that actually differentiates the customers from one another? But I also understand that monetary value could construed as high weight model because they might go upto range of 100K or millions as well.

Let's assume that I normalized and my clustering returned 3 clusters. How do I answer below questions meaningfully?

q1) what is the average revenue from customers who are under cluster 1?

q2) what is the average recency (in days) for a customer from cluster 2?

q3) what is the average age of customer with us (tenure) under cluster 3?

Response to all the above question using normalized data wouldn't make sense because they ll amight be in a unform scale mean 0, sd 1 etc

So, I was wondering whether it is meaningful to do the below

a) cluster using normalized/scaled variables

b) Once clusters are identified, use customer_id under each cluster to get the original variable value (from input dataframe before normalization) and make inference or interpret clusters?

So, do you think it would allow me to answer my questions in a meaningful way

Is this how data scientists interpret clusters? they always have to link back to input dataframe?

Topic predictive-modeling k-means clustering data-mining machine-learning

Category Data Science

Nicolas Martin · Accepted Answer · 2022年5月27日 12:35

A simple way to estimate the loss of data due to the normalization/scaling, is to apply the inverted algorithm to see how different it is from the raw data. If the data loss is very low (ex: 0.1%), scaling is not an issue.

On the other hand, if your clusterization works very well for 10k customers, it shall work well for 1 million. Generally speaking, it is better to have a very good model on a small random dataset and then increase it progressivelly until you reach the production scale.

You can either make clusters from one feature, or several features.

Due to the problem complexity, it is generally better to start with one feature, and then extend to several features.

Making clusters from several features works better with dimensional reduction algorithms (ex: UMAP), because you project all your dimensions in a 2D plan automatically and make interesting correlation studies for all customers.

If you apply a good multi dimensional clustering, all the features are taken into account and every point is represented by a customer id.

If you select a cluster through a cluster technique (ex: DBSCAN), you just have to extract the list of the customers from this cluster, filter the raw data with this list, and start your data analysis to answer q1,q2 or q3.

Note that normalization depends on the dimensional reduction algorithm you are using. UMAP wouldn't require data normalisation, whereas t-SNE or PCA requires it.

https://towardsdatascience.com/tsne-vs-umap-global-structure-4d8045acba17

Finally, clusters' interpretation should be backed by actual proofs: even if algorithms are often very efficient in clustering data, it is crucial to add indicators to check if the data has been well distributed (for instance comparing mean or standard deviation values between clusters). In some cases, if the raw data have a too wide distribution, it could be interesting to apply a log but you might loss information.

Interpreting cluster variables - raw vs scaled

About