How many people can use a single Hadoop cluster at one time?

I would like to know how many people can use a single Hadoop cluster at one time? I am asking because I need to figure out whether or not a single 5 or 10 node cluster would be sufficient to host a class of 12 to 24 students.

Also, I am wondering if anyone could recommend whether the specifications each for the nodes on a high-end educational-level cluster should be the same as for any other production-level cluster ( ie. 64G-128G RAM, 24TB hard drive space, 8 cores, etc.). I believe that the dataset sizes that students would be using will range between 20MB minimum to about 0.500TB maximum, I imagine that we will ultimately be working on real problems and datasets, even if they are not exactly considered to be big data.

Topic data apache-hadoop education

Category Data Science


The bottleneck depends on the use pattern rather than the direct number of users. If people are doing high I/O workloads, then you wont get many people on at all. Whereas if you are doing small processes or just using it as a simple data lake, you could host the sort of people you're talking about there.

Have you considered using AWS?


For initial learning you could very easily do proof of concept work against individual VMs (a 4GB VM with a pseudo cluster is enough to do basic mapreduce examples in). If you're going to use Spark I would lean towards higher memory counts if they're within budget. I'd also keep an eye towards more, lower cost nodes. A stack of Intel i7 NUCs (or similar) with 2TB consumer SSDs and 32GB of RAM will cost less than $1k per node and 10-15 of them could easily handle a class of 24 students.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.