How come same cluster category be separated?

Question

How come same cluster category be separated?

Jack Zaki Zakiul Fahmi Jailani

2021年4月13日 15:07

I have these 200 vectors which were clustered using K-means clustering based on keywords weight similarity that was given by TF-IDF (Term Frequency - Inverse Document Frequency). The vectors were clustered with respect to the vectors in four cities which are Amsterdam, Rotterdam, The Hague and, Utrecht. I have chosen k-cluster centroid = 6, which means I have cluster 0 to cluster 5. On each cluster, I also calculated the average number of keyword's numerical weight so that then I got the most relevant and least relevant set of keywords just like the picture below:

Both relevant keywords and least relevant could help the interpretation of what is the cluster about. For example, cluster 0 is related to rail transportation because in the most relevant keywords include tram, line, tramway, station, and rail. And the least relevant keywords emphasized the interpretation of cluster 0 where the keywords include photography, bicycle, iphoneography, green, nature, and flower.

And I have the cluster map of all six clusters in the Amsterdam city shows in this picture:

The problem is that in Amsterdam city, there is no cluster 0, which related to rail transportation. In my analysis opinion, it is because all the vectors that relate to rail transportation were clustered to cluster 3, which is also related to rail transportation (based on my interpretation on most relevant and least relevant keywords on both clusters). Cluster 3 also related to rail transportation because in the most relevant keywords include tram, line, tramway, station, and rail. And the least relevant keywords emphasized the interpretation of cluster 0 where the keywords include photography, bicycle, iphoneography, green, nature, and flower.

There is also the evidence that cluster 3 could not be found in Rotterdam and The Hague city because all vectors related to rail transportation in both cities were all clustered to cluster 0. Below you could find the pictures of the cluster map in both cities:

My question is whether my analysis could be justified? But how come two clusters with the same theme could be separated? why they do not cluster together?

Topic vector-space-models tfidf k-means

Category Data Science

10xAI · Accepted Answer · 2020年7月12日 12:19

This KMeans Clustering is a representative of points in the space based on 200 Features and their length. I think there is some gap in your perspective and the actual clustering.

Other than "Tram", most of the other relevant features are uncommon in both the Rail Cluster. So, two different blobs are created in space.
See this image, data are at the same place on the Tram dimension but other Features have created two different groups.

These are the approaches, you may take -
- Clean all the features for duplicate/similar meaning in the domain context e.g. rail, railway
- Try different cluster count e.g. 5 and see if these two merges
- Even if these are not merging then there is some message and you should figure it out. e.g. gvg, combino, streetcar, city

How come same cluster category be separated?

About