What Framework To Use for Asynchronous Algorithms?
I have a problem with an extremely large dataset (who doesn't?) which is stored in chunks such that there is low variance across chunks (i.e., the chunks are sort of representative). I wanted to play around with algorithms to do some classification in an asynchronous fashion but I wanted to code it up myself.
A sample code would look like
start a master
distribute 10 chunks on 10 slaves
while some criterion is not met
for each s in slave:
classify the data inexactly using some kind of iterative algorithm and return to master
master waits for any 2 slaves to report the classifier, averages the classifier and sends it back for the slaves to continue
What framework do I use? Hadoop, Spark, Other?
If I was doing this in pure-C, I would use pthreads and have a very fine control over threads, locks and mutexes. Is there any analogous framework in this distributed data-science environment?
Topic algorithms
Category Data Science