How to make k-means distributed?

Question

How to make k-means distributed?

gsamaras

2020年8月11日 08:46

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code:

def kmeans(data, k, c=None):
    if c is not None:
        centroids = c
    else:
        centroids = []
        centroids = randomize_centroids(data, centroids, k)

    old_centroids = [[] for i in range(k)] 

    iterations = 0
    while not (has_converged(centroids, old_centroids, iterations)):
        iterations += 1

        clusters = [[] for i in range(k)]

        # assign data points to clusters
        clusters = euclidean_dist(data, centroids, clusters)

        # recalculate centroids
        index = 0
        for cluster in clusters:
            old_centroids[index] = centroids[index]
            centroids[index] = np.mean(cluster, axis=0).tolist()
            index += 1


    print("The total number of data instances is: " + str(len(data)))

I have tested it for serial execution and it is OK. How to make it distributed in Hadoop? In other words, what should go to the reducer and what to the mapper?

Please note that if possible, I would like to follow the tutorial's style, since it's something I have understood.

Topic map-reduce python distributed apache-hadoop k-means

Category Data Science

Bob Baxley · Accepted Answer · 2020年8月11日 08:46

1

Bob Baxley answered at 2020年8月11日 08:46

Unless you are trying to do this as a learning exercise, just use Spark which has ML libraries made for distributed computing. See here

How to make k-means distributed?

About