How do I discern document structure from differently-tagged XML documents?

I have a body of PDF documents of differing vintage.

Our group had exported the documents as text to feed them into a natural-language parser (I think) to pull out subject-verb-predicate triples.

This hasn't performed as well as hoped so I exported the documents as XML using Acrobat Pro, hoping to capture the semantic document structure in order to pass it in as a hint to the text parser.

One document looked pretty good (something like this):

TaggedPDF-doc
  bookmark-tree.../bookmark-tree
  Sect.../Sect
  Sect.../Sect
  Sect.../Sect
  Sect.../Sect
  Sect.../Sect
  Sect.../Sect
  Sect.../Sect
  Sect.../Sect
  Sect.../Sect
/TaggedPDF-doc

Another one wasn't quite so nice semantically:

TaggedPDF-doc
  bookmark-tree.../bookmark-tree
  Part
    H1(name of document)/H1
    P(title of document)/P
    Sect
      H122 October 2013 /H1
      P.../P
      P.../P
      P.../P
      Figure.../Figure
      P id="LinkTarget_1388"PREFACE /P
      P1. Scope /P
      P.../P
      P2. Purpose /P
      P.../P
      P3. Application /P
      L.../L
      PIntentionally Blank /P
      P id="LinkTarget_1389"SUMMARY OF CHANGES /P
      P.../P
      L.../L
      PIntentionally Blank /P
      P id="LinkTarget_1390"TABLE OF CONTENTS /P
      P.../P
      P.../P    !-- Chapter 1 started here --
      P.../P
      P.../P    !-- Chapter 2 started here --
      P.../P
      P.../P    !-- Chapter 3 started here --
      P.../P
      PCHAPTER IV /P
      P.../P
      P(part of chapter 4 title)/P
      P.../P
      P.../P
      P.../P
      P.../P
      P.../P
      P.../P
      P.../P
      P.../P
      Link.../Link
      Link.../Link
      P.../P
      Link.../Link
      PIntentionally Blank /P
      P id="LinkTarget_1391"EXECUTIVE SUMMARY /P
      P(section title)/P
      P(a bullet item inside only a paragraph element)/P
      P(a bullet item inside only a paragraph element)/P
      P.../P
      P.../P
      P.../P
      P(some text inside only a paragraph element)/P
      P.../P
      P.../P
      P.../P
      P.../P
      P.../P
      P.../P
      P(some text inside only a paragraph element)/P
      P.../P
      Table.../Table
      P(some text inside only a paragraph element)/P
      P.../P
      P.../P
      P.../P
      P.../P
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
      Sect.../Sect
    /Sect
  /Part
/TaggedPDF-doc

I'm relatively new to data science but handling the "normalization" of this kind of data set (XML documents largely in a chapter-section-subsection format) is a fairly tractable problem? Or maybe even a solved one?

These are the only two documents I've looked at so far, but I suspect that each document will be snowflake enough that it could benefit from applying a machine learning algorithm to bring some consistency to the tags. I picture something very basic, like nested section title="" tags and using the PDF XML output structure as leverage whenever I can.

Clarifiation: I'd like to gain some insight on algorithms (possibly existing ones) that can can "recognize" the semantic sections of a document that's structured inconsistently and loosely, like the second example I posted above. It would be straightforward for me to do this manually, but if there's an algorithm that does this (or even part of this) then it would be an improvement over doing it manually. The particular tag structure for the target document (I guess that's what it's called) I like to be more true to what the type of data is. A subset of the HTML5 tags would work: section, article, caption, figure. OpenDocument is probably too much. But something that is a step better than calling everything a P.

Any insights / pointing in the right direction?

Topic structured-data normalization text-mining machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.