Data science and MapReduce programming model of Hadoop

Question

Data science and MapReduce programming model of Hadoop

jithinjustin

2016年8月5日 09:30

What are the different classes of data science problems that can be solved using mapreduce programming model?

Topic map-reduce apache-hadoop

Category Data Science

AceRymond · Accepted Answer · 2014年7月30日 15:32

There is a paper you should look into:

MapReduce: Distributed Computing for Machine Learning

They distinguish 3 classes of machine-learning problems that are reasonable to address with MapReduce:

Single pass algorithms
Iterative algorithms
Query based algorithms

They also give examples for each class.

Emre · Accepted Answer · 2014年7月30日 07:49

map/reduce is most appropriate for parallelizable offline computations. To be more precise, it works best when the result can be found from the result of some function of a partition of the input. Averaging is a trivial example; you can do this with map/reduce by summing each partition, returning the sum and the number of elements in the partition, then computing the overall mean using these intermediate results. It is less appropriate when the intermediate steps depend on the state of the other partitions.

ffriend · Accepted Answer · 2014年7月29日 13:43

Let's first split it into parts.

Data Science is about making knowledge from raw data. It uses machine learning, statistics and other fields to simplify (or even automate) decision making. Data science techniques may work with any data size, but more data means better predictions and thus more precise decisions.

Hadoop is a common name for a set of tools intended to work with large amounts of data. Two most important components in Hadoop are HDFS and MapReduce.

HDFS, or Hadoop Distributed File System, is a special distributed storage capable of holding really large data amounts. Large files on HDFS are split into blocks, and for each block HDFS API exposes its location.

MapReduce is framework for running computations on nodes with data. MapReduce heavily uses data locality exposed by HDFS: when possible, data is not transferred between nodes, but instead code is copied to the nodes with data.

So basically any problem (including data science tasks) that doesn't break data locality principle may be efficiently implemented using MapReduce (and a number of other problems may be solved not that efficiently, but still simply enough).

Let's take some examples. Very often analyst only needs some simple statistics over his tabular data. In this case Hive, which is basically SQL engine over MapReduce, works pretty well (there are also Impala, Shark and others, but they don't use Hadoop's MapReduce, so more on them later).

In other cases analyst (or developer) may want to work with previously unstructured data. Pure MapReduce is pretty good for transforming and standardizing data.

Some people are used to exploratory statistics and visualization using tools like R. It's possible to apply this approach to big data amounts using RHadoop package.

And when it comes to MapReduce-based machine learning Apache Mahout is the first to mention.

There's, however, one type of algorithms that work pretty slowly on Hadoop even in presence of data locality, namely, iterative algorithms. Iterative algorithms tend to have multiple Map and Reduce stages. Hadoop's MR framework reads and writes data to disk on each stage (and sometimes in between), which makes iterative (as well as any multi-stage) tasks terribly slow.

Fortunately, there are alternative frameworks that can both - use data locality and keep data in memory between stages. Probably, the most notable of them is Apache Spark. Spark is complete replacement for Hadoop's MapReduce that uses its own runtime and exposes pretty rich API for manipulating your distributed dataset. Spark has several sub-projects, closely related to data science:

Shark and Spark SQL provide alternative SQL-like interfaces to data stored on HDFS
Spark Streaming makes it easy to work with continuous data streams (e.g. Twitter feed)
MLlib implements a number of machine learning algorithms with a pretty simple and flexible API
GraphX enables large-scale graph processing

So there's pretty large set of data science problems that you can solve with Hadoop and related projects.

score 1 · Accepted Answer · 2014年7月28日 16:49

Data Science has many different sub-areas as described in my post). Nearly for each area, scientists and developer has significant contributions. To learn more about what can be done, please look at following websites:

Data Mining Algorithms & Machine Learning -> Apache Mahout
Statistics -> RHadoop
Data Warehousing & Database Querying -> SQL-MapReduce
Social Network Analysis -> Article
Bio-informatics -> Article - 1 , Article - 2

Also, there are some work on MapReduce + Excel + Cloud combination but I have not found the link.

What are the different classes of Data Science problems ...

Each "classes" is not purely homogeneous problem domain, i.e. some problem cannot be solved via map and reduce approach due to its communication cost, or algorithm behavior. What I mean by behavior is that some problem wants to have control on all data sets instead of chunks. Thus, I refuse to list type of problem "classes".

Do not forget that knowing what MapReduce can do is not enough for Data Science. You should also aware of What MapReduce can't do, too.

Data science and MapReduce programming model of Hadoop

About