Extracting structure and content from invoices

Question

Extracting structure and content from invoices

Don Draper

2022年3月31日 14:01

Lately, I have been largely inspired by this https://rossum.ai/, which is able to extract text from invoice documents.

Do you have any ideas on how this could be implemented? It's clear that they did a lot of research to reach this performance level, but in my case I am interested in the overall approach to such problems.

If I understand correctly, the first part of the pipeline is to extract different blocks from the document. In that case, is object detection the right approach to get bounding boxes around the blocks? I guess it might not be really good at extracting tabular data.

If not object detection, what is the correct way to tackle the problem?

Thanks.

Topic object-detection ocr text

Category Data Science

Peter · Accepted Answer · 2019年5月22日 12:36

I think extracting relevant details from an invoice in commercial applications certainly involves a lot of high spec algorithms. Maybe you are right that they identify relevant parts first and extract the details afterwards.

However, my first starting point would be to get all the text from an invoice (e.g. via tesseract). If you have a decent photo, tesseract will be able to OCR the content. The next step would be to identify relevant content, such as payment amount, names, and bank account numbers. This may be possible by hardcoded rules to some extent. Alternatively, one could use NLP-like models to detect certain sequences. With some effort, this should work out well since invoices are relatively structured documents.

https://pypi.org/project/pytesseract/

https://github.com/tesseract-ocr/tesseract/wiki

Extracting structure and content from invoices

About