Best way to vectorise names and addresses for similarity searching?
I have a large dataset of around 9 million people with names and addresses. Given quirks of the process used to get the data it is highly likely that a person is in the dataset more than once, with subtle differences between each record. I want to identify a person and their 'similar' personas with some sort of confidence metric for the alternative records identified.
My inital thoughts on an approach is to vectorise each name and address as a concatenated string using word embeddings, load them all into Elasticsearch and then use the KNN search funcionality to 'cluster' similar records and use the Euclidean distance between each point in the cluster as a similarity metric.
Now I think about this, I don't think it would work as word embeddings pick up on semantic relationship and names and addresses by definition are semantically neutral. There are other vectorising approaches like bag-of-words, n-grams and TF-IDF, but these will produce lots of high dimensional sparse vectors that won't work well with KNN and Elasticsearch uses TF-IDF to search out of the box so why mess about with vectors at all?
My questions are:
- Does this approach sound overly engineered?
- If not, are there vectorising approaches that would better (such as hashing)?
- If yes to the above, am I at least on the right lines for a valid approach?
This is more of a sound board post, but any opinions would be really helpful. Thanks!
Topic elastic-search k-nn search-engine word-embeddings nlp
Category Data Science