How to get the probability/closeness of a sample belonging to a specific cluster?

Question

How to get the probability/closeness of a sample belonging to a specific cluster?

Jaskaran Singh Puri

2022年3月3日 13:06

I'm new to this so please let me know if my logic of comparing cosine similarity and k-means is incorrect

I got a set of 4 clusters from k-means and now I'm interested in the Cluster No. 1. For this cluster, I take the average of all values for each column and keep it aside.

Now, I have a test sample, for which I run k-means prediction and I get output as 1, meaning it belongs to Cluster No. 1 which is good for me but my use-case here was to calculate that even if that sample didn't belong to Cluster 1, how close was it to falling in that Cluster No. 1

Hence, to resolve this I thought of doing a cosine similarity between my test sample and the one where I take average of all values for each column. Now, in this case, I get a similarity of just 5%

I'm not sure, for my use-case i.e. (Getting the probability/closeness of a sample belonging to a specific cluster) which is a better interpretation for me?

I know I can use the cluster labels as y variables and make multi-class classification model but I want to keep it as un-supervised as possible. Please guide

Topic unsupervised-learning cosine-distance classification k-means clustering

Category Data Science

Aj_MLstater · Accepted Answer · 2021年5月25日 16:01

Try Gaussian Mixture Model (GMM) as it is similar to KMeans, but differs in a few ways. In a nutshell think of KMeans as a hard clustering model where each sample is assigned to only one cluster, whereas GMM is a soft clustering technique that calculates the density (probability) of each Gaussian mixture (which can be considered as clusters) containing the data point in question. You can get both labels and probability scores from the model. Try it and see if it helps in your case. It is available from the SciKit Learn library.

Another approach in case if you have to stick to KMeans could be:

Take the cluster centers from the KMeans model.
Take your test sample vector and pass these as parameters to the softmax function to get the probability score for all the cluster centers per sample.

How to get the probability/closeness of a sample belonging to a specific cluster?

About