How to start with receipt OCR text detection

Question

How to start with receipt OCR text detection

noobiedatascientist

2022年5月4日 21:01

I'm thinking about an OCR system for digitalizing receipts. On the input system would take a picture of receipt and then return classified data (total_sum = Y, date = X, etc.). My question is regarding how to start. My initial thought was that I should start with detecting classes (name of the shop, receipt id etc.) on image and splitting it, then I would send parts of image OCR. My second idea is more NLP based. I would normally pass an image to OCR system and then try to do some classification on text data. Which approach would be better?

Topic ocr

Category Data Science

porra · Accepted Answer · 2021年3月9日 09:28

Not sure if you tried Tesseract or pytesseract yet. I never worked on receipts but I had a different use case where I needed to segment images before I applied the tesseract for text recognition. For simple use case and especially in English, tesseract is quite powerful so I would start with the 2nd option first and see if it's good enough. Probably you may look at this answer on stackoverflow for idea. https://stackoverflow.com/questions/55140090/pytesseract-reading-receipt

How to start with receipt OCR text detection

About