Optimal clusters for K-means not clear - any ideas?

I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am trying to cluster them using K-means. I know there are easier ways of doing this, but the reason I am doing it like this is out of curiosity.

In order to vectorize each person I am concatenating the strings as follows: [name][address][birthday] and then running this through the following function to tokenise and clean the string:

def preprocess_text(text):
    # remove links
    text = re.sub(rhttp\S+, , text)
    # remove special chars and numbers
    text = re.sub([^A-Za-z]+,  , text)
    # remove stopwords
    if remove_stopwords:
        # 1. tokenize
        tokens = nltk.word_tokenize(text)
        # 2. check if stopword
        tokens = [w for w in tokens if not w.lower() in stopwords.words(english)]
        # 3. join back together
        text =  .join(tokens)
    # return text in lower case and stripped of whitespaces
    text = text.lower().strip()
    return text

This outputs to a dataframe containing the corpus where each 'person' consists of string that is something like:

john smith belgrave road birmingham england

The entire corpus is then run through the TF-IDF vectorizer in sklearn.

When I use the Elbow method and the Silhouette Method to try and get the optimimal cluster number K for K-means, I get two graphs that don't show a clear value of K:

I've not really seen anything like this before and I was wondering if anyone has ideas for an optimal value of K based on these charts? Or, if these charts show that K-means really isn't a good way of doing this? Or, perhaps a TF-IDF vectorizer isn't a good approach and a Bag-of-Words vectoriser would be better as names and addresses should be semantically neutral? Any insight would be really appreciated. Thank you!

Topic tfidf scikit-learn nlp k-means clustering

Category Data Science


First, if you are interested in identifying duplicate records via clustering I would suggest you to use agglomerative hierarchical algorithm (e.g. single-linkage). It will result with a complete dendogramm and somewhere in the bottom of this dendogramm you will find your duplicated records.

Second, regarding TF-IDF representation, I am not sure it will yield a good result, since TF-IDF relies on word frequency, thus different spelling of the same name (e.g. Jim and James) or address will result in different vectors. For a somehow simplier solution I would run your data through any pre-trained language model (e.g. word2vec from gensim https://radimrehurek.com/gensim/models/word2vec.html), sum up all embeddings from the same person and then run either a KNN or a hierarchical clustering.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.