mahout clusterdump top terms meaning
I apologize that this has been asked and I feel that it may be obvious, but I am wondering exactly what the meaning of the numerical value below from clusterdump:
Top Terms:
monkey = 0.8170868432876803
I believe that to the be center of the centroid. But if the term vectors were created with term frequencies, could one interpret this as the average occurrence of the "monkey" in the documents that are considered part of the cluster? In this case, "monkey" would appear in 82% of the docs in that cluster or more likely that the average count of monkey is .82?
Looking further, I see words like so:
Top Terms: zebra => 3.432595573440644
So it is best to interpret this as the average count of "zebra" in the set of docs...
And given the radius values, one could consider that the range of percentages of "monkey"?
mahout seq2sparse -i out/sequenced \
-o out/sparse-kmeans -wt TF --maxDFPercent 100 --namedVector
...
mahout kmeans \
-i out/sparse-kmeans/tf-vectors/ \
-c out/kmeans-clusters \
-o out/kmeans \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-x 10 -k $i -ow --clustering
When one uses tf-idf weighting, it may be best to normalize the output weights by creating a proportion of evidence via Wi=Wi/sum(W) Is that a good idea? (Some Python LDA libs do this.)
Thank you.
References
https://mahout.apache.org/users/clustering/cluster-dumper.html
Topic apache-mahout k-means
Category Data Science