Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int). I group the relation based on featureid and want to calculate the max feature value of each group, heres the code: grpd = GROUP features BY featureid; DUMP grpd; temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val; Its giving me Invalid scalar projection: grpd Exception. I read on different forums that …
Category: Data Science

Pig is not able to read the complete data

I am trying to load a huge dataset of around 3.4 TB with approximately 1.4 million files in Pig on Amazon EMR. The operations on the data are simple (JOIN and STORE), but the data is not getting loaded completely, and the program is terminating with a java outofmemory exception. I've tried increasing the Pig Heap size to 8192, but that hasn't worked, however my code works fine if I use only 25% of the dataset. This is the last …
Category: Data Science

Convert date into number - Apache PIG

Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this: Data_ID = FOREACH File GENERATE CONCAT((chararray)SUBSTRING(Date,0,4),(chararray)SUBSTRING(Date,6,2),(chararray)SUBSTRING(Date,9,2)); But I'm getting a list of nulls... Anyone knows what I'm doing wrong? Thnaks!
Topic: etl apache-pig
Category: Data Science

unable to parse XML in pig

I have a XML file has this structure (not exactly a tree though) <posthistory> <row Id="1" PostHistoryTypeId="2" PostId="1" RevisionGUID="689cb04a-8d2a-4fcb-b125-bce8b7012b88" CreationDate="2015-01-27T20:09:32.720" UserId="4" Text="I just got a pound of microroasted, local coffee and am curious what the optimal way to store it is (what temperature, humidity, etc)" /> I am using apache pig to extract just the "Text" part using this code grunt> A = load 'hdfs:///parsingdemo/PostHistory.xml' using org.apache.pig.piggybank.storage.XMLLoader('posthistory') as(x:chararray); grunt> result = foreach A generate XPath(x, 'posthistory/Text'); this returns "()" (null) …
Category: Data Science

Extract company names/job titles from free text

I have a complete Hadoop platform with HDFS, MR, Hive, PIG, Hbase, etc., Python, R, Java. All data sets have a large size. The data set A, describing the jobs of people working in a company, is composed of the following fields: Id Person: a unique alphanumeric identifier per person. Start Date: a date format iso entry in the post End Date: iso size release date of the position. If the date is not given, it is the current position …
Category: Data Science

Hadoop/Pig Aggregate Data

I am working on a project with two data sets. A time vs. speed data set (let's call it traffic), and a time vs. weather data set (called weather). I am looking to find a correlation between these two sets using Pig. However the traffic data set has the time field, D/M/Y hr:min:sec, and the weather data set has the time field, D/M/Y. Due to this I would like to average the speed per day and put it into a …
Category: Data Science

Pig Rank function not generating rank in output

I am facing this bizarre issue while using Apache Pig rank utility. I am executing the following code: email_id_ranked = rank email_id; store email_id_ranked into '/tmp/'; So, basically I am trying to get the following result 1,email1 2,email2 3,email3 ... Issue is sometime pig dumps the above result but sometimes it dumps only the emails without the rank. Also when I dump the data on screen using dump function pig returns both the columns. I don't know where the issue …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.