Best way to vectorise names and addresses for similarity searching?

I have a large dataset of around 9 million people with names and addresses. Given quirks of the process used to get the data it is highly likely that a person is in the dataset more than once, with subtle differences between each record. I want to identify a person and their 'similar' personas with some sort of confidence metric for the alternative records identified. My inital thoughts on an approach is to vectorise each name and address as a …
Category: Data Science

name entity recognition on misspeled words produced by OCR

I need to do entity recognition on a set of text data. There are two important aspects here text data is produced from an OCR which infact has tons of mis-spelled words. For example it produces Stabhylooocjs lve vit Salnomela can not lve on cober surfcs chikens gut i ful of Strebt0cus but not if hey get fd wih Aectat Nucopactirun is he seond bet berklorabe producer instead of Staphylococcus live with Salmonella can not live on copper surfaces Chickens …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.