What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for Spark, but I'm curious if anyone has encountered a problem that was more efficient and easier to solve with Spark compared to Hadoop.

Topic cloud-computing apache-spark knowledge-base distributed apache-hadoop

Category Data Science


It would be fair to compare Spark with MapReduce - Hadoop's processing framework. In the majority of cases, Spark may outperform MapReduce. The former enables in-memory data processing, which makes it possible to process data up to 100 times faster. For this reason, Spark is a preferred option if you need insights quickly, for example, if you need to

  • run customer analytics, e.g. compare the behavior of a customer with the behavior patterns of a particular customer segment and trigger certain actions;
  • manage risks and forecast various possible scenarios;
  • detect fraud in real-time;
  • run industrial big data analytics and predict anomalies and machine failures.

However, MapReduce is good at processing really huge datasets (if you are fine with the time required for processing). Besides, it's a more economical solution, as MapReduce reads from/writes to a disk. And disks are generally cheaper than memory.


Good info @Sean Owen. Would like to add one additional. Spark may help to build Unified data pipelines in Lambda architecture addressing both Batch and Streaming layers with an ability to write to common serving layer. It is huge advantage to reuse the logic between batch and Streaming. Also Streaming K-Means algorithms in Spark1.3 is an added plus to ML apart from excellent job monitoring and process visualizations in 1.4.


Machine learning is a good example of a problem type where Spark-based solutions are light-years ahead of mapreduce-based solutions, despite the young age of spark-on-yarn.


Hadoop means HDFS, YARN, MapReduce, and a lot of other things. Do you mean Spark vs MapReduce? Because Spark runs on/with Hadoop, which is rather the point.

The primary reason to use Spark is for speed, and this comes from the fact that its execution can keep data in memory between stages rather than always persist back to HDFS after a Map or Reduce. This advantage is very pronounced for iterative computations, which have tens of stages each of which is touching the same data. This is where things might be "100x" faster. For simple, one-pass ETL-like jobs for which MapReduce was designed, it's not in general faster.

Another reason to use Spark is its nicer high-level language compared to MapReduce. It provides a functional programming-like view that mimics Scala, which is far nicer than writing MapReduce code. (Although you have to either use Scala, or adopt the slightly-less-developed Java or Python APIs for Spark). Crunch and Cascading already provide a similar abstraction on top of MapReduce, but this is still an area where Spark is nice.

Finally Spark has as-yet-young but promising subprojects for ML, graph analysis, and streaming, which expose a similar, coherent API. With MapReduce, you would have to turn to several different other projects for this (Mahout, Giraph, Storm). It's nice to have it in one package, albeit not yet 'baked'.

Why would you not use Spark? paraphrasing myself:

  • Spark is primarily Scala, with ported Java APIs; MapReduce might be friendlier and more native for Java-based developers
  • There is more MapReduce expertise out there now than Spark
  • For the data-parallel, one-pass, ETL-like jobs MapReduce was designed for, MapReduce is lighter-weight compared to the Spark equivalent
  • Spark is fairly mature, and so is YARN now, but Spark-on-YARN is still pretty new. The two may not be optimally integrated yet. For example until recently I don't think Spark could ask YARN for allocations based on number of cores? That is: MapReduce might be easier to understand, manage and tune

Not sure about the YARN, but I think that Spark makes a real difference compared to Hadoop (advertised as 100 times faster) if data can fit nicely in the memory of the computational nodes. Simply because it avoids hard disk access. If data doesn't fit memory there's still some gain because of buffering.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.