What is the main difference between Hadoop and Spark?

I recently read the following about Hadoop vs. Spark:

Insist upon in-memory columnar data querying. This was the killer-feature that let Apache Spark run in seconds the queries that would take Hadoop hours or days. Memory is much faster than disk access, and any modern data platform should be optimized to take advantage of that speed. Also, columnar data storage greatly reduces the amount of memory spent on empty or redundant data.

Can someone explain: 1) what Apache Hadoop and Spark are, 2) how they differ, and 3) how this relates to memory vs. disk access.

Topic apache-spark apache-hadoop bigdata

Category Data Science


Hadoop is a framework for the distributed storage and processing of big data on the Hadoop File System (HDFS) where data is stored in a cluster of "nodes" and can be set up to be fault tolerant. Since data is stored accross multiple nodes it can be processed in parallel, and Hadoop uses the MapReduce algorithm for doing so. This is basically achieved by each node in the cluster fetching the data it needs from disk, performing the neecessary computations which are then aggregated and returned.

Spark is a distributed processing framework for big data, but does not provide storage. Consequently it needs to work on top of distributed storage, which could be Hadoop. Spark is designed as an in-memory engine, and is therefore much faster than MapReduce on Hadoop. It is also fault tolerant. Spark includes a version of SQL which allows for much better querying of the underlying data compared to Hadoop/MapReduce. Since in practice, Spark is often still limited by memory availability when the underlying data is too large, so there is still some disk-bound operations, but at least in theory it is capable of 100% in-memory real time analytics. This is not possible with Hadoop.

Both Hadoop and Spark can work with columnar data, although Spark is columnar by default, and in Hadoop, the underlying data is often row-oriented, such as csv files (but can support columnar data such as parquet and ORC). Data stored in columnar format is much faster to query.

Hadoop is designed for large scale batch processing. Spark is also designed for batch processing but can also handle streaming and other data flows. Both are scalable technologies, but Hadoop scales nearly linearly, whereas with Spark, although it will generally be faster than Hadoop for similar sized data, there are limitations based on the memory available in the cluster, above which performance will deteriorate much faster than with Hadoop. That is, with Hadoop speed will decrease approximately linearly as the data size increases. With Spark, once the memory in the cluster is exceeded, relative performance will decline faster than with Hadoop.

Due to the memory needed in a Spark cluster, it will generally cost more to operate than Hadoop in terms of uptime. However for smaller datasets the much faster in-memory processing may completely offset the higher cost of compute time.

Some of the above are purposely general statements, there could be edge cases and other situations where they don't apply.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.