Is it accurate to say that "K-means clustering the vectors based on keywords weight similarity"?

Question

Is it accurate to say that "K-means clustering the vectors based on keywords weight similarity"?

Jack Zaki Zakiul Fahmi Jailani

2020年7月10日 10:01

Long story short, I have 200 vectors as a result of TF-IDF (Term Frequency - Inverse Document Frequency) on thousands of keywords in hundreds of vectors. The total number of unique keywords that I got is 745 keywords, meaning that there are 745 dimensions/axes. Now, I was wondering how does K-means clustering work on those 200 vectors? Is it accurate to say that K-Means is clustering those 200 vectors by the keywords weight similarity?

Topic vector-space-models tfidf k-means machine-learning

Category Data Science

Nikos M. · Accepted Answer · 2020年7月10日 10:01

Long story short:

The distance measure between two vectors used in k-means algorithm is / can be user-provided. So, it can mean whatever the user wants it to mean by providing the appropriate measure.

Now, for TF-IDF and a geometric measure (eg Euclidean metric) one can say that two feature vectors will be close if the have similar frequencies for their respective keywords.

Tinu · Accepted Answer · 2020年7月10日 07:41

K-means clusters points (in your case represented as 745 dimensional vectors) by their similarity, that is some distance measure between points (usually the Euclidean distance).

TF-IDF produces a vector from a sentence or document, where each entry (axis) represents the frequency of a word divided by the frequency of it's occurrence in all sentences or documents, hence the name. Other weighting scheme are possible as well, see here.

If two vectors are very close to each other, it means that the content of their documents is very similar. Therefore it's likely they end up in the same cluster. In contrast, if two vectors are far away, the words in each document might be completely different or the frequency of words might be different.

So the distance between vectors can be interpreted as measure of similarity between the documents content.

Is it accurate to say that "K-means clustering the vectors based on keywords weight similarity"?

About