apache-pig

Error when using MAX in Apache Pig (Hadoop)

Akshay Gupta

2018年4月27日 08:42

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int). I group the relation based on featureid and want to calculate the max feature value of each group, heres the code: grpd = GROUP features BY featureid; DUMP grpd; temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val; Its giving me Invalid scalar projection: grpd Exception. I read on different forums that …

Topic: apache-pig apache-hadoop

Category: Data Science

Pig is not able to read the complete data

shanky_thebearer

2016年12月28日 10:45

I am trying to load a huge dataset of around 3.4 TB with approximately 1.4 million files in Pig on Amazon EMR. The operations on the data are simple (JOIN and STORE), but the data is not getting loaded completely, and the program is terminating with a java outofmemory exception. I've tried increasing the Pig Heap size to 8192, but that hasn't worked, however my code works fine if I use only 25% of the dataset. This is the last …

Topic: apache-pig map-reduce apache-hadoop

Category: Data Science

Convert date into number - Apache PIG

João_testeSW

2016年9月1日 17:10

Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this: Data_ID = FOREACH File GENERATE CONCAT((chararray)SUBSTRING(Date,0,4),(chararray)SUBSTRING(Date,6,2),(chararray)SUBSTRING(Date,9,2)); But I'm getting a list of nulls... Anyone knows what I'm doing wrong? Thnaks!

Topic: etl apache-pig

Category: Data Science

unable to parse XML in pig

sc3339

2016年6月28日 17:57

I have a XML file has this structure (not exactly a tree though) <posthistory> <row Id="1" PostHistoryTypeId="2" PostId="1" RevisionGUID="689cb04a-8d2a-4fcb-b125-bce8b7012b88" CreationDate="2015-01-27T20:09:32.720" UserId="4" Text="I just got a pound of microroasted, local coffee and am curious what the optimal way to store it is (what temperature, humidity, etc)" /> I am using apache pig to extract just the "Text" part using this code grunt> A = load 'hdfs:///parsingdemo/PostHistory.xml' using org.apache.pig.piggybank.storage.XMLLoader('posthistory') as(x:chararray); grunt> result = foreach A generate XPath(x, 'posthistory/Text'); this returns "()" (null) …

Topic: apache-pig data-cleaning apache-hadoop

Category: Data Science

Extract company names/job titles from free text

user17241

2015年2月11日 02:06

I have a complete Hadoop platform with HDFS, MR, Hive, PIG, Hbase, etc., Python, R, Java. All data sets have a large size. The data set A, describing the jobs of people working in a company, is composed of the following fields: Id Person: a unique alphanumeric identifier per person. Start Date: a date format iso entry in the post End Date: iso size release date of the position. If the date is not given, it is the current position …

Topic: hbase apache-pig apache-hadoop

Category: Data Science

Hadoop/Pig Aggregate Data

BigDataDude

2014年12月23日 19:46

I am working on a project with two data sets. A time vs. speed data set (let's call it traffic), and a time vs. weather data set (called weather). I am looking to find a correlation between these two sets using Pig. However the traffic data set has the time field, D/M/Y hr:min:sec, and the weather data set has the time field, D/M/Y. Due to this I would like to average the speed per day and put it into a …

Topic: apache-pig correlation beginner apache-hadoop

Category: Data Science

Pig Rank function not generating rank in output

Ankit

2014年8月8日 17:32

I am facing this bizarre issue while using Apache Pig rank utility. I am executing the following code: email_id_ranked = rank email_id; store email_id_ranked into '/tmp/'; So, basically I am trying to get the following result 1,email1 2,email2 3,email3 ... Issue is sometime pig dumps the above result but sometimes it dumps only the emails without the rank. Also when I dump the data on screen using dump function pig returns both the columns. I don't know where the issue …

Topic: apache-pig apache-hadoop bigdata

Category: Data Science

Error when using MAX in Apache Pig (Hadoop)

Pig is not able to read the complete data

Convert date into number - Apache PIG

unable to parse XML in pig

Extract company names/job titles from free text

Hadoop/Pig Aggregate Data

Pig Rank function not generating rank in output

About