Similar values cleaning

Question

Similar values cleaning

miro_muras

2020年12月3日 10:31

can someone know algorithm how to identify account names that are similar enough to be potentially merged and imported as one

Duplicates with different values: Geico val1 NaN ===== Geico val1 val2 Geico NaN val2

Similar or almost exact Geico Gaico

Topic data-wrangling data pandas

Category Data Science

n1k31t4 · Accepted Answer · 2020年12月3日 10:31

You specifically talk about account names, and so I assume they can be treated as strings.

One way to compare closeness of strings is the Levenshtein distance, defined as:

the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

It just so happens there is a nice library that implements this kind of fuzzy matching - fuzzywuzzy. They have some usage examples on the homepage.

Ideas for processing the data

In your case, if you know the correct account names, you could compute the similarity of just those correct ones to each of the actual entries, and use a threshold value to turn all close-matches into the correct account name.

Alternatively, you could compute pairwise similarities pair up the highest scores, reducing each pair to a single name. Iterate on this approach until you have no name-pairs with a similarity above a given threshold.

For the thresholds, in either case, you'd have to probably use a heuristic value.

Similar values cleaning

Ideas for processing the data

About