Pig is not able to read the complete data
I am trying to load a huge dataset of around 3.4 TB with approximately 1.4 million files in Pig on Amazon EMR. The operations on the data are simple (JOIN and STORE), but the data is not getting loaded completely, and the program is terminating with a java outofmemory exception.
I've tried increasing the Pig Heap size to 8192, but that hasn't worked, however my code works fine if I use only 25% of the dataset.
This is the last line of the complete log file:
20124550 [Service Thread] DEBUG
org.apache.pig.impl.util.SpillableMemoryManager - memory handler call -
Collection threshold init = 374341632(365568K) used = 698749896(682372K) committed = 698875904(682496K) max = 698875904(682496K)java.lang.OutOfMemoryError: Java heap space -XX:OnOutOfMemoryError="kill -9 %p
Here's my code for loading the dataset:
log = LOAD s3://some-path/{20151101..20151130}/{00,01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21,22,23}/' USING PigStorage ('\n') AS (line:chararray);
I've specified each date individually, so that is not the issue.
Also I've tried using 35 machines in the cluster, but that hasn't helped. I could potentially add more, but that is not the type of solution I am looking for.
Can anyone can suggest better approach to coding this problem, or some other solution?
Topic apache-pig map-reduce apache-hadoop
Category Data Science