I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int). I group the relation based on featureid and want to calculate the max feature value of each group, heres the code: grpd = GROUP features BY featureid; DUMP grpd; temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val; Its giving me Invalid scalar projection: grpd Exception. I read on different forums that …
I am trying to load a huge dataset of around 3.4 TB with approximately 1.4 million files in Pig on Amazon EMR. The operations on the data are simple (JOIN and STORE), but the data is not getting loaded completely, and the program is terminating with a java outofmemory exception. I've tried increasing the Pig Heap size to 8192, but that hasn't worked, however my code works fine if I use only 25% of the dataset. This is the last …
Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this: Data_ID = FOREACH File GENERATE CONCAT((chararray)SUBSTRING(Date,0,4),(chararray)SUBSTRING(Date,6,2),(chararray)SUBSTRING(Date,9,2)); But I'm getting a list of nulls... Anyone knows what I'm doing wrong? Thnaks!
I have a XML file has this structure (not exactly a tree though) <posthistory> <row Id="1" PostHistoryTypeId="2" PostId="1" RevisionGUID="689cb04a-8d2a-4fcb-b125-bce8b7012b88" CreationDate="2015-01-27T20:09:32.720" UserId="4" Text="I just got a pound of microroasted, local coffee and am curious what the optimal way to store it is (what temperature, humidity, etc)" /> I am using apache pig to extract just the "Text" part using this code grunt> A = load 'hdfs:///parsingdemo/PostHistory.xml' using org.apache.pig.piggybank.storage.XMLLoader('posthistory') as(x:chararray); grunt> result = foreach A generate XPath(x, 'posthistory/Text'); this returns "()" (null) …
I have a complete Hadoop platform with HDFS, MR, Hive, PIG, Hbase, etc., Python, R, Java. All data sets have a large size. The data set A, describing the jobs of people working in a company, is composed of the following fields: Id Person: a unique alphanumeric identifier per person. Start Date: a date format iso entry in the post End Date: iso size release date of the position. If the date is not given, it is the current position …
I am working on a project with two data sets. A time vs. speed data set (let's call it traffic), and a time vs. weather data set (called weather). I am looking to find a correlation between these two sets using Pig. However the traffic data set has the time field, D/M/Y hr:min:sec, and the weather data set has the time field, D/M/Y. Due to this I would like to average the speed per day and put it into a …
I am facing this bizarre issue while using Apache Pig rank utility. I am executing the following code: email_id_ranked = rank email_id; store email_id_ranked into '/tmp/'; So, basically I am trying to get the following result 1,email1 2,email2 3,email3 ... Issue is sometime pig dumps the above result but sometimes it dumps only the emails without the rank. Also when I dump the data on screen using dump function pig returns both the columns. I don't know where the issue …