unable to parse XML in pig

I have a XML file has this structure (not exactly a tree though)

posthistory
row Id="1" PostHistoryTypeId="2" PostId="1" 
RevisionGUID="689cb04a-8d2a-4fcb-b125-bce8b7012b88" 
CreationDate="2015-01-27T20:09:32.720" UserId="4" Text="I just got a 
pound of microroasted, local coffee and am curious what the optimal 
way to store it is (what temperature, humidity, etc)" /

I am using apache pig to extract just the "Text" part using this code

grunt A = load 'hdfs:///parsingdemo/PostHistory.xml' using 
org.apache.pig.piggybank.storage.XMLLoader('posthistory') as(x:chararray);

grunt result = foreach A generate XPath(x, 'posthistory/Text');

this returns "()" (null)

Upon examining the XML file, I learnt that my XML file should be in this format:

root
  child
    subchild...../subchild
  /child
/root 

But my XML data file (stackoverflow data dump actually) is not in this format. Is there a way the tree structure can be imposed? what is wrong with my pig query?

Topic apache-pig data-cleaning apache-hadoop

Category Data Science


This XPath will look for a tag called <Text> inside a tag called <posthistory>:

XPath(x, 'posthistory/Text');

You want to find the Text attribute of the row tag in posthistory tags.

An XPath something like this will do that: /posthistory/row/@Text

See example here: http://www.xpathtester.com/xpath/bac9874ec344f9d8ebcfb250633aaf65 and click "Test" to see results set.

Learn up on XPath notation for more.


Use regular expression.Following is a generic format

 foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<child>\\s*<subchild1>(.*)</subchild1>\\s*<subchild2>(.*)</subchild2>\\s*</child>'));

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.