What Framework To Use for Asynchronous Algorithms?

I have a problem with an extremely large dataset (who doesn't?) which is stored in chunks such that there is low variance across chunks (i.e., the chunks are sort of representative). I wanted to play around with algorithms to do some classification in an asynchronous fashion but I wanted to code it up myself.

A sample code would look like

start a master
distribute 10 chunks on 10 slaves
while some criterion is not met 
 for each s in slave:
  classify the data inexactly using some kind of iterative algorithm and return to master
 master waits for any 2 slaves to report the classifier, averages the classifier and sends it back for the slaves to continue 

What framework do I use? Hadoop, Spark, Other?

If I was doing this in pure-C, I would use pthreads and have a very fine control over threads, locks and mutexes. Is there any analogous framework in this distributed data-science environment?

Topic algorithms

Category Data Science


Spark is one of the most established distributed computing frameworks. Spark has MLlib, a library for machine learning, which includes many classification algorithms.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.