Best way to vectorise names and addresses for similarity searching?

I have a large dataset of around 9 million people with names and addresses. Given quirks of the process used to get the data it is highly likely that a person is in the dataset more than once, with subtle differences between each record. I want to identify a person and their 'similar' personas with some sort of confidence metric for the alternative records identified.

My inital thoughts on an approach is to vectorise each name and address as a concatenated string using word embeddings, load them all into Elasticsearch and then use the KNN search funcionality to 'cluster' similar records and use the Euclidean distance between each point in the cluster as a similarity metric.

Now I think about this, I don't think it would work as word embeddings pick up on semantic relationship and names and addresses by definition are semantically neutral. There are other vectorising approaches like bag-of-words, n-grams and TF-IDF, but these will produce lots of high dimensional sparse vectors that won't work well with KNN and Elasticsearch uses TF-IDF to search out of the box so why mess about with vectors at all?

My questions are:

  1. Does this approach sound overly engineered?
  2. If not, are there vectorising approaches that would better (such as hashing)?
  3. If yes to the above, am I at least on the right lines for a valid approach?

This is more of a sound board post, but any opinions would be really helpful. Thanks!

Topic elastic-search k-nn search-engine word-embeddings nlp

Category Data Science


I think you are right when you say that word embeddings etc will likely deliver "bad" results since names and addresses are highly individual.

With your type of problem you will probably need to compare $1:n$, which is expensive. Since each entry needs to see each other entry only one time, you can likely reduce the effort by using itertools.combinations(lst, 2) to generate unique pairs. However, with 9 mio. entries this will still give some 40,500,000,000,000 pairs to check (including "self" as a reference).

In case you don't want to use string distances, one possible solution could be to compare a single vectorized entry to all others. This could be done based on a count vectorizer with $n$-grams on the character level. By doing so you may be able to compare "similar" names, addresses etc. Redundant text seems not to be a big issue here.

from sklearn.feature_extraction.text import CountVectorizer

true = ["Mister Example 10 Liversidge St Acton ACT 2601 Australia"] 
alt1 = ["Mr Example Liversidge St 10 2601 Acton ACT Australia"]
alt2 = ["Mister Example 10 Liversidge St Acton ACT 2601 Australia and some redundant text here"]
alt3 = ["Ms Catlady 11 Liversidge St Acton ACT 2601 Australia"]
alt4 = ["Mr Entirely Different Other Street 123 9031 Anywhere USA"]

vec = CountVectorizer(analyzer="char_wb", ngram_range=(3,3))
x = vec.fit(true)

print(x.get_feature_names())

print(vec.transform(true).todense().sum()) # "truth" is the reference here
print(vec.transform(alt1).todense().sum())
print(vec.transform(alt2).todense().sum())
print(vec.transform(alt3).todense().sum())
print(vec.transform(alt4).todense().sum())
# The single arguments true, alt1 etc can also be packt into a vector and transformed in one step

This will give:

48 # truth as reference value
42 # minor variations to truth
48 # truth with redundant text
33 # similar address
3  # different address

Your initial thought to use embeddings is the right way to go. Token level similarity metrics such as distance and TF-IDF etc. are going to only take you so far. You will end up chasing variations in spelling, whether they are the acceptable kind (Road vs RD.) or just misspelling (Mississippi vs Missippi). You can try to build dictionaries, use lemmatizers etc., but it'll be quite the battle just to get them all correct enough.

Try using SentenceBert (https://www.sbert.net/). It can use different models, all trained on different datasets and with differing vector sizes. So some experimentation will be needed.

Example

For example, I ran a small sample that compared embeddings for the following "sentences"

Address Landmark
1 First Street, NE Washington, DC 20543 US Supreme Court
1600 Pennsylvania Avenue NW, Washington, DC 20500 The White House (address)
The White House The White House (name)
20 W 34th St, New York, NY 10001 Empire State Building

and saw the following Cosine Similarities

SCOTUS WH (addr) WH ESB
SCOTUS 100.00% 83.06% 42.30% 68.60%
WH (addr) 83.06% 100.00% 43.95% 69.73%
WH 42.30% 43.95% 100.00% 36.20%
ESB 68.60% 69.73% 36.20% 100.00%

Results

As you can see the name "The White House" and it's address are a 100% match. No token based approach will give us this level of match. We also have a good match between the address for the Supreme Court and the WH address, but lower on the name.

Elasticsearch, KNN, etc.##

YMMV with Elasticsearch depending on what you are doing with the results. Also, the same with KNN, clustering etc. Clustering will bring in a completely different dimension where you will have to understand the meaning of each cluster before you can use the outcome.


The problem you are describing is commonly called record linkage, in particular probabilistic record linkage.

Clustering the embeddings would work if the different personas for the same entity frequently co-occur. Each item has to be tagged with metadata so clustering would only return the same type of info (e.g., only person names).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.