Best way to vectorise names and addresses for similarity searching?
I have a large dataset of around 9 million people with names and addresses. Given quirks of the process used to get the data it is highly likely that a person is in the dataset more than once, with subtle differences between each record. I want to identify a person and their 'similar' personas with some sort of confidence metric for the alternative records identified. My inital thoughts on an approach is to vectorise each name and address as a …
Category:
Data Science