Choosing between Storm+Trident-ML, Storm+SAMOA or Spark Streaming+MLlib

I want to implement Streaming Naive Bayes in a distributed system. What are the best approach to choose framework. Should I choose:

  1. Storm alone and implement streaming naive bayes on my own in storm topology.
  2. Storm + TridentML
  3. Storm + SAMOA
  4. Spark Streaming + MLlib

What is the best framework set to choose and start working on. Any suggestion will be of great help.

Topic apache-spark classification distributed data-stream-mining machine-learning

Category Data Science


It depends. If you need a fast way to mine streams of data and use adaptative training of data sets, the best tool is SAMOA, because it could be easily integrated with Storm or S4 stream processing engines. If you need only to process batch data in a fast and distributed manner, the Spark MLLib would be the best solution among them.


If I were you, I would pick anyone of the frameworks I am comfortable with and implement the use-case. Spark-Streaming + MLlib should work and would be my choice since its user base is on the rise and it is one of the most popular project under the Apache Umbrella with good enterprise business plan. Both Cloudera and Hortonworks provide enterprise level support. Now, in theory Spark-Streaming lacks behind Storm in stream processing, but the framework is cool in a way that it provides you the option to do streaming, common map and reduce, graph processing and SQL under the same framework. So once you have the pipeline to convert your data to RDD you are good for most of the common jobs related to Data Analysis. It's written from scratch in Scala which is a very powerful language and provides huge scalability in a distributed setup when handling concurrency. Hope this helps, feel free to reach out to me with any questions you have.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.