Which string distance equation for fuzzy-matching person names is reliable?
A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext
library in R has the following available string methods c(osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex)
. Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names:
- The left join shouldn't get mixed up by long / short names.
Michael Gadson
is clearlyMike Gadson
, not one of the other Mike names in the dataset with a different last name. - The left join shouldn't get mixed up by reversed names.
Ricky Smith
isRick Smith
, he is notSmith Rickie
. - The left join shouldn't get mixed up by
III
,Jr.
, etc. suffix to names, or by extra spaces or symbols: eg.De Andre'
vsDeAndre
) - Certain players (e.g.
Johnny Williams
) in the left-hand-side dataframe have no match in the right-side table. To catch this, we'll need to rely on a properly selectedmax_dist
value.
A 5th concern is avoiding duplicates in the code (we want only 1 row for each person in the left-hand-side dataframe), however this is handled with the groupby(fullName) %% filter(dist == min(dist) | is.na(dist))
in the code.
Our question is then: given these concerns, what is a good method and max distance to use for this left join?
Topic jaccard-coefficient similarity r
Category Data Science