What is the difference between Hadoop and noSQL

I heard about many tools / frameworks for helping people to process their data (big data environment).

One is called Hadoop and the other is the noSQL concept. What is the difference in point of processing?

Are they complementary?

Topic apache-hadoop processing tools nosql

Category Data Science


Hadoop is not a database, hadoop is an entire ecosystem.

the hadoop ecosystem

Most people will refer to mapreduce jobs while talking about hadoop. A mapreduce job splits big datasets in some little chunks of data and spread them over a cluster of nodes to get proceed. In the end the result from each node will be put together again as one dataset.


Let's assume you load into hadoop a set of <String, Integer> with the population of some neighborhoods within a city and you want to get the average population over the whole neighborhoods of each city(figure 1).

figure 1

    [new york, 40394]
    [new york, 134]
    [la, 44]
    [la, 647]
    ...

Now hadoop will first map each value by using the keys (figure 2)

figure 2

[new york, [40394,134]]
[la, [44,647]]
...

After the mapping it will reduce the values of each key to a new value (in this example the average over the value set of each key)(figure 3)

figure 3

[new york, [20264]]
[la, [346]]
...

now hadoop would be done with everything. You can now load the result into the HDFS (hadoop distributed file system) or into any DBMS or file.

Thats just one very basic and simple example of what hadoop can do. You can run much more complicated tasks in hadoop.

As you already mentioned in your question, hadoop and noSQL are complementary. I know a few setups where i.e. billions of datasets from sensors are stored in HBase and get then through hadoop to finally be stored in a DBMS.


NoSQL is a way to store data that does not require there to be some sort of relation. The simplicity of its design and horizontal scale-ability, one way they store data is the key : value pair design. This lends itself to processing that is similar to Hadoop. The use of a NoSQL db really depends on the type of problem that one is after.

Here is a good wikipedia link NoSQL

Hadoop is a system that is meant to store and process huge chunks of data. It is a distributed file system dfs. The reason it does this is that central to its design it makes the assumption that hardware failures are common, thus making multiple copies of the same piece of information and spreading it across multiple machines and racks, so if one goes down, no problem, we have two more copies. Here is a great link for Hadoop from wikipedia as well, you will see that it is, in my opinion more than just storage, but also processing: Hadoop

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.