Which string distance equation for fuzzy-matching person names is reliable?
A reproducible example with a small bit of R code is available in this stackoverflow post (link so I dont need to re-type out the code). The fuzzytext library in R has the following available string methods c(osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex). Our use case is matching (left-joining) basketball player names from 2 different sources. From the stackoverflow post, we have the following concerns to account for when string matching names:
- The left join shouldn't get mixed up by long / short names.
Michael Gadsonis clearlyMike Gadson, not one of the other Mike names in the dataset with a different last name. - The left join shouldn't get mixed up by reversed names.
Ricky SmithisRick Smith, he is notSmith Rickie. - The left join shouldn't get mixed up by
III,Jr., etc. suffix to names, or by extra spaces or symbols: eg.De Andre'vsDeAndre) - Certain players (e.g.
Johnny Williams) in the left-hand-side dataframe have no match in the right-side table. To catch this, we'll need to rely on a properly selectedmax_distvalue.
A 5th concern is avoiding duplicates in the code (we want only 1 row for each person in the left-hand-side dataframe), however this is handled with the groupby(fullName) %% filter(dist == min(dist) | is.na(dist)) in the code.
Our question is then: given these concerns, what is a good method and max distance to use for this left join?
Topic jaccard-coefficient similarity r
Category Data Science