How come same cluster category be separated?
I have these 200 vectors which were clustered using K-means clustering based on keywords weight similarity that was given by TF-IDF (Term Frequency - Inverse Document Frequency). The vectors were clustered with respect to the vectors in four cities which are Amsterdam, Rotterdam, The Hague and, Utrecht. I have chosen k-cluster centroid = 6, which means I have cluster 0 to cluster 5. On each cluster, I also calculated the average number of keyword's numerical weight so that then I got the most relevant and least relevant set of keywords just like the picture below:
Both relevant keywords and least relevant could help the interpretation of what is the cluster about. For example, cluster 0 is related to rail transportation because in the most relevant keywords include tram, line, tramway, station, and rail. And the least relevant keywords emphasized the interpretation of cluster 0 where the keywords include photography, bicycle, iphoneography, green, nature, and flower.
And I have the cluster map of all six clusters in the Amsterdam city shows in this picture:
The problem is that in Amsterdam city, there is no cluster 0, which related to rail transportation. In my analysis opinion, it is because all the vectors that relate to rail transportation were clustered to cluster 3, which is also related to rail transportation (based on my interpretation on most relevant and least relevant keywords on both clusters). Cluster 3 also related to rail transportation because in the most relevant keywords include tram, line, tramway, station, and rail. And the least relevant keywords emphasized the interpretation of cluster 0 where the keywords include photography, bicycle, iphoneography, green, nature, and flower.
There is also the evidence that cluster 3 could not be found in Rotterdam and The Hague city because all vectors related to rail transportation in both cities were all clustered to cluster 0. Below you could find the pictures of the cluster map in both cities:
My question is whether my analysis could be justified? But how come two clusters with the same theme could be separated? why they do not cluster together?
Topic vector-space-models tfidf k-means
Category Data Science