mahout clusterdump top terms meaning

Question

mahout clusterdump top terms meaning

Chris

2016年4月1日 13:07

I apologize that this has been asked and I feel that it may be obvious, but I am wondering exactly what the meaning of the numerical value below from clusterdump:

Top Terms:
   monkey       =  0.8170868432876803

I believe that to the be center of the centroid. But if the term vectors were created with term frequencies, could one interpret this as the average occurrence of the "monkey" in the documents that are considered part of the cluster? In this case, "monkey" would appear in 82% of the docs in that cluster or more likely that the average count of monkey is .82?

Looking further, I see words like so:

Top Terms: zebra => 3.432595573440644

So it is best to interpret this as the average count of "zebra" in the set of docs...

And given the radius values, one could consider that the range of percentages of "monkey"?

mahout seq2sparse -i out/sequenced \
    -o out/sparse-kmeans -wt TF --maxDFPercent 100 --namedVector

...

mahout kmeans \
    -i out/sparse-kmeans/tf-vectors/ \
    -c out/kmeans-clusters \
    -o out/kmeans \
    -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
    -x 10 -k $i -ow --clustering

When one uses tf-idf weighting, it may be best to normalize the output weights by creating a proportion of evidence via Wi=Wi/sum(W) Is that a good idea? (Some Python LDA libs do this.)

Thank you.

References

https://mahout.apache.org/users/clustering/cluster-dumper.html

Topic apache-mahout k-means

Category Data Science

mahout clusterdump top terms meaning

About