How do I discern document structure from differently-tagged XML documents?
I have a body of PDF documents of differing vintage.
Our group had exported the documents as text to feed them into a natural-language parser (I think) to pull out subject-verb-predicate triples.
This hasn't performed as well as hoped so I exported the documents as XML using Acrobat Pro, hoping to capture the semantic document structure in order to pass it in as a hint to the text parser.
One document looked pretty good (something like this):
TaggedPDF-doc
bookmark-tree.../bookmark-tree
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
/TaggedPDF-doc
Another one wasn't quite so nice semantically:
TaggedPDF-doc
bookmark-tree.../bookmark-tree
Part
H1(name of document)/H1
P(title of document)/P
Sect
H122 October 2013 /H1
P.../P
P.../P
P.../P
Figure.../Figure
P id="LinkTarget_1388"PREFACE /P
P1. Scope /P
P.../P
P2. Purpose /P
P.../P
P3. Application /P
L.../L
PIntentionally Blank /P
P id="LinkTarget_1389"SUMMARY OF CHANGES /P
P.../P
L.../L
PIntentionally Blank /P
P id="LinkTarget_1390"TABLE OF CONTENTS /P
P.../P
P.../P !-- Chapter 1 started here --
P.../P
P.../P !-- Chapter 2 started here --
P.../P
P.../P !-- Chapter 3 started here --
P.../P
PCHAPTER IV /P
P.../P
P(part of chapter 4 title)/P
P.../P
P.../P
P.../P
P.../P
P.../P
P.../P
P.../P
P.../P
Link.../Link
Link.../Link
P.../P
Link.../Link
PIntentionally Blank /P
P id="LinkTarget_1391"EXECUTIVE SUMMARY /P
P(section title)/P
P(a bullet item inside only a paragraph element)/P
P(a bullet item inside only a paragraph element)/P
P.../P
P.../P
P.../P
P(some text inside only a paragraph element)/P
P.../P
P.../P
P.../P
P.../P
P.../P
P.../P
P(some text inside only a paragraph element)/P
P.../P
Table.../Table
P(some text inside only a paragraph element)/P
P.../P
P.../P
P.../P
P.../P
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
Sect.../Sect
/Sect
/Part
/TaggedPDF-doc
I'm relatively new to data science but handling the "normalization" of this kind of data set (XML documents largely in a chapter-section-subsection format) is a fairly tractable problem? Or maybe even a solved one?
These are the only two documents I've looked at so far, but I suspect that each document will be snowflake enough that it could benefit from applying a machine learning algorithm to bring some consistency to the tags. I picture something very basic, like nested section title=""
tags and using the PDF XML output structure as leverage whenever I can.
Clarifiation: I'd like to gain some insight on algorithms (possibly existing ones) that can can "recognize" the semantic sections of a document that's structured inconsistently and loosely, like the second example I posted above. It would be straightforward for me to do this manually, but if there's an algorithm that does this (or even part of this) then it would be an improvement over doing it manually. The particular tag structure for the target document (I guess that's what it's called) I like to be more true to what the type of data is. A subset of the HTML5 tags would work: section
, article
, caption
, figure
. OpenDocument is probably too much. But something that is a step better than calling everything a P
.
Any insights / pointing in the right direction?
Topic structured-data normalization text-mining machine-learning
Category Data Science