Can all statistical algorithms be parallelized using a Map Reduce framework

Is it correct to say that any statistical learning algorithm (linear/logistic regression, SVM, neural network, random forest) can be implemented inside a Map Reduce framework? Or are there restrictions?

I guess there may be some algorithms that is not possible to parallelize?

Topic map-reduce apache-hadoop machine-learning

Category Data Science


Whether or not all tractable problems can be parallelized efficiently (efficiently = achieve a gain in speed) is an open problem in computer science. Assuming that AC != P then not all algorithms can be Map Reduced in a way that results in a speed gain.

NC complexity Wikipedia page


No.

Map reduce by definition works at one record at a time.

Thus, you cannot compute a distance/similarity matrix without abusing the programming model. Methods such as SVM do require pairwise similarities.

Such abuses are really common though (which is why map reduce is dead). Most easily, you can map every object to the key 0, and do all the work in the "reducer". More clever approaches divide the data into k partitions, so they can run at least some work in parallel.

But all of these approaches are ugly hacks that hack around a deliberate design decision made for map reduce: in map reduce, every record must be processed independently, to allow the platform to automatically optimize parallelism and recovery. In above hack, the map phase is abused to split the data into the manually chosen parallelism, and then the reducers are simple parallel jobs. It's no longer mapreduce, but the abstract model of parallel programming on partitioned data.

After all the early hype, mapreduce is now dead. Tools such as Apache Mahout no longer accept mapreduce but expect you to move on to more flexible pardigms that don't need such abusive hacks.


I would suggest that you can implement pretty much any kind of data processing you want in Map Reduce given enough time to code it, but the degree of parallelisation you will get will vary depending on what your data is and what you're doing to it.

I can imagine a simple scenario where parallelisation would be reduced dramatically. For example, a simple JOIN task with 4 mapper and 4 reducer nodes will be highly parallelised if the join keys (1, 2, 3 and 4) are evenly distributed across your mapper nodes, and there are the same number of each (4) - 4 reducers would then get an equal share of work for each join key:

Mapper    1 2 3 4   |   1 2 3 4   |   1 2 3 4   |   1 2 3 4
             / \         / \            / \           / \
            v   v       v   v          v   v         v   v

Reducer   1 1 1 1   |   2 2 2 2   |  3 3 3 3    |   4 4 4 4

However if you have a situation where most of your join keys are the same (e.g. nearly all 1s), the reducer dealing with 1s will get swamped:

Mapper    1 1 1 1   |   1 1 1 1   |   1 1 1 1   |   1 1 1 2
             /          /            /             /     /
            v          v            v             v     v

Reducer   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  |  2  |  nothing!  |  nothing!

Because 1 reducer gets totally swamped while others do nothing, you will lose most of your parallelisation while Hadoop stalls and waits for the reducer to finish.

Caveat: There may be more sophisticated Map Reduce implementations out there which deal with this kind of problem, I only know the basics of Hadoop.

Also I know this is nothing to do with statistical learning algorithms (which i know almost nothing about) but I imagine the principle is still true - some algorithms or data just won't divide into highly parallel sub-tasks.


Indeed there are:

  • Gradient Boosting is by construction sequential, so parallelization is not really possible
    • Generalized Linear Models need all data at the same time, although technically you can parallelize some of the inner linear algebra nuts and bolts
    • Support Vector Machines

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.