Which one of these tasks will benefit the most from SPARK?

My company processes Data (I am only an intern). We primarily use Hadoop. We're starting to deploy spark in production. Currently we have have two jobs, we will choose just one to begin with spark. The tasks are:

  1. The first job does analysis of a large quantity of text to look for ERROR messages (grep).
  2. The second job does machine learning calculate models prediction on some data with an iterative way.

My question is: Which one of the two jobs will benefit from SPARK the most?

SPARK relies on memory so I think that it is more suited to machine learning. The quantity of DATA isn't that large compared to the logs JOB. But I'm not sure. Can someone here help me if I neglected some piece of information?

Topic apache-spark map-reduce apache-hadoop

Category Data Science


I think second job will benefit more from spark than the first one. The reason is machine learning and predictive models often run multiple iterations on data.

As you have mentioned, spark is able to keep data in memory between two iterations while Hadoop MapReduce has to write and read data to file system.

Here is a good comparison of the two frameworks :

https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

enter image description here

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.