Optimal clusters for K-means not clear - any ideas?
I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am trying to cluster them using K-means. I know there are easier ways of doing this, but the reason I am doing it like this is out of curiosity.
In order to vectorize each person I am concatenating the strings as follows: [name][address][birthday] and then running this through the following function to tokenise and clean the string:
def preprocess_text(text):
# remove links
text = re.sub(rhttp\S+, , text)
# remove special chars and numbers
text = re.sub([^A-Za-z]+, , text)
# remove stopwords
if remove_stopwords:
# 1. tokenize
tokens = nltk.word_tokenize(text)
# 2. check if stopword
tokens = [w for w in tokens if not w.lower() in stopwords.words(english)]
# 3. join back together
text = .join(tokens)
# return text in lower case and stripped of whitespaces
text = text.lower().strip()
return text
This outputs to a dataframe containing the corpus where each 'person' consists of string that is something like:
john smith belgrave road birmingham england
The entire corpus is then run through the TF-IDF vectorizer in sklearn.
When I use the Elbow method and the Silhouette Method to try and get the optimimal cluster number K for K-means, I get two graphs that don't show a clear value of K:
I've not really seen anything like this before and I was wondering if anyone has ideas for an optimal value of K based on these charts? Or, if these charts show that K-means really isn't a good way of doing this? Or, perhaps a TF-IDF vectorizer isn't a good approach and a Bag-of-Words vectoriser would be better as names and addresses should be semantically neutral? Any insight would be really appreciated. Thank you!
Topic tfidf scikit-learn nlp k-means clustering
Category Data Science