Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

On the node from which I run Dumbo commands, all the files produced as output are produced on the same node. For example, suppose there is a node having name hvs on which I ran the script:

dumbo start matrix2seqfile.py -input hdfs://hm1/user/trainf1.csv -output hdfs://hm1/user/train_hdfs5.mseq -numreducetasks 25 -hadoop $HADOOP_INSTALL

When I inspect my file system, I find that all the files produced are accumulated only in the hvs node.

Ideally, I'd like the files to get distributed throughout the cluster--my data is not getting balanced throughout the cluster. Can anyone advise me on how to fix this?

Topic map-reduce python apache-hadoop bigdata

Category Data Science


In hadoop etc folder,hdfs-site.xml was having number of replication units as 1.That is why,all the files were getting saved on one single node.I changed it and the problem resolved.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.