name entity recognition on misspeled words produced by OCR

I need to do entity recognition on a set of text data. There are two important aspects here

  1. text data is produced from an OCR which infact has tons of mis-spelled words. For example

it produces

  • Stabhylooocjs lve vit Salnomela can not lve on cober surfcs
  • chikens gut i ful of Strebt0cus but not if hey get fd wih Aectat
  • Nucopactirun is he seond bet berklorabe producer

instead of

  • Staphylococcus live with Salmonella can not live on copper surfaces
  • Chickens gut is full of Streptococcus but not if they get fed with
  • Acetate Mycobacterium is the second best Perchlorate producer
  1. when it comes to entity recognition, these are specific entities (species, molecules, bacteria, ...) meaning I would like the NLU/NLP model tag the bolded words into these categories.

as you can imagine, I don't think correcting the misspelled words is that important to me; what is important is to tag those highlighted words into those tags. but since I have a dictionary of those words I thought I can use it.

where should I start, which model, package do you recommend if exist for this task ? is there any OCR models where I can enforce to use certain dictionary ? or rather I should do it as a post-analysis ? how should I train the NLP/NLU related module ?

Topic elastic-search ocr named-entity-recognition nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.