How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code:

def kmeans(data, k, c=None):
    if c is not None:
        centroids = c
    else:
        centroids = []
        centroids = randomize_centroids(data, centroids, k)

    old_centroids = [[] for i in range(k)] 

    iterations = 0
    while not (has_converged(centroids, old_centroids, iterations)):
        iterations += 1

        clusters = [[] for i in range(k)]

        # assign data points to clusters
        clusters = euclidean_dist(data, centroids, clusters)

        # recalculate centroids
        index = 0
        for cluster in clusters:
            old_centroids[index] = centroids[index]
            centroids[index] = np.mean(cluster, axis=0).tolist()
            index += 1


    print("The total number of data instances is: " + str(len(data)))

I have tested it for serial execution and it is OK. How to make it distributed in Hadoop? In other words, what should go to the reducer and what to the mapper?

Please note that if possible, I would like to follow the tutorial's style, since it's something I have understood.

Topic map-reduce python distributed apache-hadoop k-means

Category Data Science


Unless you are trying to do this as a learning exercise, just use Spark which has ML libraries made for distributed computing. See here

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.