name entity recognition on misspeled words produced by OCR

Question

name entity recognition on misspeled words produced by OCR

user702846

2022年2月10日 12:54

I need to do entity recognition on a set of text data. There are two important aspects here

text data is produced from an OCR which infact has tons of mis-spelled words. For example

it produces

Stabhylooocjs lve vit Salnomela can not lve on cober surfcs

chikens gut i ful of Strebt0cus but not if hey get fd wih Aectat

Nucopactirun is he seond bet berklorabe producer

instead of

Staphylococcus live with Salmonella can not live on copper surfaces

Chickens gut is full of Streptococcus but not if they get fed with

Acetate Mycobacterium is the second best Perchlorate producer

when it comes to entity recognition, these are specific entities (species, molecules, bacteria, ...) meaning I would like the NLU/NLP model tag the bolded words into these categories.

as you can imagine, I don't think correcting the misspelled words is that important to me; what is important is to tag those highlighted words into those tags. but since I have a dictionary of those words I thought I can use it.

where should I start, which model, package do you recommend if exist for this task ? is there any OCR models where I can enforce to use certain dictionary ? or rather I should do it as a post-analysis ? how should I train the NLP/NLU related module ?

Topic elastic-search ocr named-entity-recognition nlp

Category Data Science

name entity recognition on misspeled words produced by OCR

About