Elbow method for cosine distance

I have clustered vectors by cosine distance using nltk clusterer. If I understand correctly, Y axis for elbow method in euclidian distance would be the sum of every distance (squared) between centroid of the cluster with vectors that belongs to that cluster.

My question is: Would it be the same for clusters using cosine distance?

EDIT: ok, so i tried sum of squares with cosine distance and it seems, that it's returning the same values... heres my code:

EDIT2: My bad,is is working

from nltk.cluster import KMeansClusterer, cosine_distance
import numpy as np

#Load dataset obtained from http://cs.joensuu.fi/sipu/datasets/a1.txt
testing_vectors = np.loadtxt("a1.txt")

for k in range(1,10):
    kclusterer = KMeansClusterer(k, distance=cosine_distance)
    assigned_clusters = kclusterer.cluster(testing_vectors, assign_clusters=True)

    sum_of_squares = 0
    current_cluster = 0
    for centroid in kclusterer.means():
        current_page = 0
        for index_of_cluster_of_page in assigned_clusters:
            if index_of_cluster_of_page == current_cluster:
                y = testing_vectors[current_page]
                #sum_of_squares += np.sum((centroid - y) ** 2)
                sum_of_squares += (np.dot(centroid,y)**2)/(np.dot(centroid,centroid) * np.dot(y,y))
            current_page += 1
        current_cluster += 1

    print("for k=%s the sum of squares is:%s" %(k,sum_of_squares))
```

Topic cosine-distance nltk

Category Data Science


Ok. So what I understood is, that for cosine metrics, I can use both: Sum of squared distances from centroids to vectors that belong to clusters, where the distance can be calculated as euclidian or as cosine (cosine would be probably more precise, but more complicated(thanks to dot product)). The squared distance is only used as optimization, so we don't have to calculate the square root in both euclidian and cosine distance formula.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.