Similar values cleaning

can someone know algorithm how to identify account names that are similar enough to be potentially merged and imported as one

Duplicates with different values: Geico val1 NaN ===== Geico val1 val2 Geico NaN val2

Similar or almost exact Geico Gaico

Topic data-wrangling data pandas

Category Data Science


You specifically talk about account names, and so I assume they can be treated as strings.

One way to compare closeness of strings is the Levenshtein distance, defined as:

the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

It just so happens there is a nice library that implements this kind of fuzzy matching - fuzzywuzzy. They have some usage examples on the homepage.


Ideas for processing the data

In your case, if you know the correct account names, you could compute the similarity of just those correct ones to each of the actual entries, and use a threshold value to turn all close-matches into the correct account name.

Alternatively, you could compute pairwise similarities pair up the highest scores, reducing each pair to a single name. Iterate on this approach until you have no name-pairs with a similarity above a given threshold.

For the thresholds, in either case, you'd have to probably use a heuristic value.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.