unable to parse XML in pig
I have a XML file has this structure (not exactly a tree though)
posthistory
row Id="1" PostHistoryTypeId="2" PostId="1"
RevisionGUID="689cb04a-8d2a-4fcb-b125-bce8b7012b88"
CreationDate="2015-01-27T20:09:32.720" UserId="4" Text="I just got a
pound of microroasted, local coffee and am curious what the optimal
way to store it is (what temperature, humidity, etc)" /
I am using apache pig to extract just the "Text" part using this code
grunt A = load 'hdfs:///parsingdemo/PostHistory.xml' using
org.apache.pig.piggybank.storage.XMLLoader('posthistory') as(x:chararray);
grunt result = foreach A generate XPath(x, 'posthistory/Text');
this returns "()" (null)
Upon examining the XML file, I learnt that my XML file should be in this format:
root
child
subchild...../subchild
/child
/root
But my XML data file (stackoverflow data dump actually) is not in this format. Is there a way the tree structure can be imposed? what is wrong with my pig query?
Topic apache-pig data-cleaning apache-hadoop
Category Data Science