Can map-reduce algorithms written for MongoDB be ported to Hadoop later?

In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required analyses:

  1. One approach is to extract the data from MongoDB to a Hadoop cluster and do the analysis completely in Hadoop platform. However, this requires considerable investment on preparing the platform (software and hardware) and educating the team to work with Hadoop and write map-reduce tasks for it.

  2. Another approach is to just put our effort on designing the map-reduce algorithms, and run the algorithms on MongoDB map-reduce functionalities. This way, we can create an initial prototype of final system that can generate the reports. I know that the MongoDB's map-reduce functionalities are much slower compared to Hadoop, but currently the data is not that big that makes this a bottleneck yet, at least not for the next six months.

The question is, using the second approach and writing the algorithms for MongoDB, can them be later ported to Hadoop with little needed modification and algorithm redesign? MongoDB just supports JavaScript but programming language differences are easy to handle. However, is there any fundamental differences in the map-reduce model of MongoDB and Hadoop that may force us to redesign algorithms substantially for porting to Hadoop?

Topic mongodb map-reduce apache-hadoop scalability

Category Data Science


There will definitely be a translation task at the end if you prototype using just mongo.

When you run a MapReduce task on mongodb, it has the data source and structure built in. When you eventually convert to hadoop, your data structures might not look the same. You could leverage the mongodb-hadoop connector to access mongo data directly from within hadoop, but that won't be quite as straightforward as you might think. The time to figure out how exactly to do the conversion most optimally will be easier to justify once you have a prototype in place, IMO.

While you will need to translate mapreduce functions, the basic pseudocode should apply well to both systems. You won't find anything that can be done in MongoDB that can't be done using Java or that is significantly more complex to do with Java.


You also can create a MongoDB-Hadoop connection.


You can use map reduce algorithms in Hadoop without programming them in Java. It is called streaming and works like Linux piping. If you believe that you can port your functions to read and write to terminal, it should work nicely. Here is example blog post which shows how to use map reduce functions written in Python in Hadoop.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.