Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

Question

Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

Harshvardhan Solanki

2016年1月28日 23:10

On the node from which I run Dumbo commands, all the files produced as output are produced on the same node. For example, suppose there is a node having name hvs on which I ran the script:

dumbo start matrix2seqfile.py -input hdfs://hm1/user/trainf1.csv -output hdfs://hm1/user/train_hdfs5.mseq -numreducetasks 25 -hadoop $HADOOP_INSTALL

When I inspect my file system, I find that all the files produced are accumulated only in the hvs node.

Ideally, I'd like the files to get distributed throughout the cluster--my data is not getting balanced throughout the cluster. Can anyone advise me on how to fix this?

Topic map-reduce python apache-hadoop bigdata

Category Data Science

Harshvardhan Solanki · Accepted Answer · 2015年7月2日 13:14

1

Harshvardhan Solanki answered at 2015年7月2日 13:14

In hadoop etc folder,hdfs-site.xml was having number of replication units as 1.That is why,all the files were getting saved on one single node.I changed it and the problem resolved.

Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

About