Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster
On the node from which I run Dumbo commands, all the files produced as output are produced on the same node. For example, suppose there is a node having name hvs on which I ran the script:
dumbo start matrix2seqfile.py -input hdfs://hm1/user/trainf1.csv -output hdfs://hm1/user/train_hdfs5.mseq -numreducetasks 25 -hadoop $HADOOP_INSTALL
When I inspect my file system, I find that all the files produced are accumulated only in the hvs node.
Ideally, I'd like the files to get distributed throughout the cluster--my data is not getting balanced throughout the cluster. Can anyone advise me on how to fix this?
Topic map-reduce python apache-hadoop bigdata
Category Data Science