Which string distance equation for fuzzy-matching person names is reliable?

Question

Which string distance equation for fuzzy-matching person names is reliable?

Canovice

2022年5月17日 12:29

A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c(osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names:

The left join shouldn't get mixed up by long / short names. Michael Gadson is clearly Mike Gadson, not one of the other Mike names in the dataset with a different last name.
The left join shouldn't get mixed up by reversed names. Ricky Smith is Rick Smith, he is not Smith Rickie.
The left join shouldn't get mixed up by III, Jr., etc. suffix to names, or by extra spaces or symbols: eg. De Andre' vs DeAndre)
Certain players (e.g. Johnny Williams) in the left-hand-side dataframe have no match in the right-side table. To catch this, we'll need to rely on a properly selected max_dist value.

A 5th concern is avoiding duplicates in the code (we want only 1 row for each person in the left-hand-side dataframe), however this is handled with the groupby(fullName) %% filter(dist == min(dist) | is.na(dist)) in the code.

Our question is then: given these concerns, what is a good method and max distance to use for this left join?

Topic jaccard-coefficient similarity r

Category Data Science

Which string distance equation for fuzzy-matching person names is reliable?

About