What is the best approach to extract keys/values from documents?
I am thinking of training a model to automatically extract information from more or less structured documents like invoices.
Here are the main challenges regarding this task:
In fact, even though invoices are often called structured documents, there is a lot of variance in their layouts depending on the field, company, and other factors, which makes it almost impossible to achieve great results with pattern matching.
When relying on text, the most straightforward solution to get textual information from documents is using an OCR engine like Tesseract or EasyOCR (or, probably, commercial solutions from Google, Amazon, Microsoft). Unfortunately, more often than not, outputs from OCR libraries are rather dirty, which can be explained by the quality of scans.
Some fields can appear in a document multiple times. This is especially true for line items
Some values might belong to several classes (entities) at the same time. For example, extracting addresses, company names is definitely not a one-to-one mapping in that several companies are mentioned with the corresponding addresses and other types of information.
That said, what could you recommend as a possible approach to this problem? I understand that given (1)-(4) there is no way I should expect a perfect solution, but at least something that would at least make sense to try.
My current thoughts:
This problem has a lot of similarity with traditional NER-like tasks, but those are most often trained on sentenced where there's a lot of semantical value and not just key/value pairs.
I am almost 100% certain that without some rules, even some hardcoded ones, it will be extremely difficult to get something useful. One such idea is to use a pre-trained NER model, extract standard entities like MONEY, DATE, COMPANY, etc, and then just look in the neighborhood of these entities (in terms of coordinates, I mean) to check whether phrases like
Total amount
orInvoice date
are nearby.Probably, it makes sense to use both bounding boxes and text values for labeling and further model building. On the other hand, it would be quite challenging to do that labeling because some entities might change their location from document to document, span over several lines, etc.
What should I read/try to have a better understanding of how to approach this problem?
Thanks in advance!
Topic ocr named-entity-recognition nlp machine-learning
Category Data Science