Cluster Evaluation with Jaccard and Rand Index

I've clusterized my data according to 3 criteria in 3 groups. I used kmeans to obtain those cluster so the label for each cluster is random and changes at each script run.

To evaluate the consistency of my clusters I decided to use Jaccard index but I can't understand how to apply it properly.

Let's say I have this data where alpha beta and gamma are the 3 methods, and the Cluster Index is the value returned by K-means for example.

name CI_alpha CI_beta CI_gamma
a 1 2 2
b 1 2 3
c 1 2 3
d 2 3 3
e 2 3 1
f 2 3 2
g 3 1 1
h 3 1 3

What is noteworthy in this dataset is that the 2 methods alpha and beta for clustering actually returned a perfect match of clusters but Jaccard index between those 2 would return a 0 because all labels are different even though they actually describe the same clusters.

Do you have any idea on how to correctly obtain an informative index?

Also, I'd like to know if it is possible to use Rand index even if I don't actually know the real cluster to obtain an estimate of concordance between clustering methods or it's completely out of its scope.

Topic model-evaluations jaccard-coefficient visualization python clustering

Category Data Science


Rand index (also consider the adjusted rand index) measures exactly that, the similarity between two clusterings of the data. In python you can use sklearn for that, have a look at their Clustering performance evaluation for more options.

Rand index counts the agreements over all pairs between two clusterings in the data, so Ci_alpha and Ci_beta would have a result of 1.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.