How to process word similarity and categorize a group of words to a single word

Am new in this area and have been searching for some time only to find multiple different possible approaches but nothing concrete.

If I have a wordlist of say; email_addr, email, email_address, address or more dissimilarly first, first_name, firstName, christianName, christian_name, name. What would be the most suitable approach to classify each of those lists to a single word, like email or givenName respectively?

I've seen some articles proposing; Levenstein, fuzzy matching, difference algorithm, support vector machines of which I don't think any quite satisfy the requirement, unless I am missing something.

Would appreciate any links or direction to research.

Essentially, the objective is to categorize all column names in a data set so I can map them to a method for each type of column to generate mock data.

Topic fuzzy-classification classification machine-learning

Category Data Science


Some ideas:

  1. "Cluster" words in a single list to find the "closest" matches. This could be useful since in email_addr, email, email_address, address the word address could be seen as an "outlier". You can use affinity propagation to cluster words if needed. However, I think this step is only needed if there is a lot of "variance" in the words.
  2. Once you have an okay list of words such as email_addr, email, email_address, you can apply a pairwise levenshtein distance to each word pair and pick the $n$ "closest" matches ("pairs"). With three words (as above), keeping the two closest matches would likely yield: email_addr, email_address.
  3. Keep as a "truth" the common parts of the $n$ top matches, which could be email_addr or simply email in this case.

I have a similar problem in the moment and would apprechiatre any insights from your experiance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.