How to start with receipt OCR text detection

I'm thinking about an OCR system for digitalizing receipts. On the input system would take a picture of receipt and then return classified data (total_sum = Y, date = X, etc.). My question is regarding how to start. My initial thought was that I should start with detecting classes (name of the shop, receipt id etc.) on image and splitting it, then I would send parts of image OCR. My second idea is more NLP based. I would normally pass an image to OCR system and then try to do some classification on text data. Which approach would be better?

Topic ocr

Category Data Science


Not sure if you tried Tesseract or pytesseract yet. I never worked on receipts but I had a different use case where I needed to segment images before I applied the tesseract for text recognition. For simple use case and especially in English, tesseract is quite powerful so I would start with the 2nd option first and see if it's good enough. Probably you may look at this answer on stackoverflow for idea. https://stackoverflow.com/questions/55140090/pytesseract-reading-receipt

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.