name entity recognition on misspeled words produced by OCR
I need to do entity recognition on a set of text data. There are two important aspects here
- text data is produced from an OCR which infact has tons of mis-spelled words. For example
it produces
- Stabhylooocjs lve vit Salnomela can not lve on cober surfcs
- chikens gut i ful of Strebt0cus but not if hey get fd wih Aectat
- Nucopactirun is he seond bet berklorabe producer
instead of
- Staphylococcus live with Salmonella can not live on copper surfaces
- Chickens gut is full of Streptococcus but not if they get fed with
- Acetate Mycobacterium is the second best Perchlorate producer
- when it comes to entity recognition, these are specific entities (species, molecules, bacteria, ...) meaning I would like the NLU/NLP model tag the bolded words into these categories.
as you can imagine, I don't think correcting the misspelled words is that important to me; what is important is to tag those highlighted words into those tags. but since I have a dictionary of those words I thought I can use it.
where should I start, which model, package do you recommend if exist for this task ? is there any OCR models where I can enforce to use certain dictionary ? or rather I should do it as a post-analysis ? how should I train the NLP/NLU related module ?
Topic elastic-search ocr named-entity-recognition nlp
Category Data Science