KMeans clusterization on documents
Whether correct or not, I'm not able to judge being myself in the early days of the Data Science.
However, I have applied a Kmeans on a corpus where some random documents (very short sentences) have been added. These have been vectiorized so to be suitable.
With clusterization results at hands, I was somehow expecting the vectors (keyword) to fall only in one cluster at a time (and no more than that). This is not the case.
In some circumstances, I have a vector falling in two clusters and I wonder why this is the case.
- Is this because of the inappropriate usage of Kmeans on vectors made from documents?
- Is this normal as the way Kmeans works (moving the centroids, but de facto assigning objects to the nearest cluster by distance)?
- Is this overlap due to the fact that in analysing my results I assess the whole group of items within a cluster and not just (say) the top X near to the center?
-- Example:
corpus = [
'The car is driven on the road.',
'The truck is driven on the highway.',
'The train run on the tracks.',
'The bycicle is run on the pavement.',
'The flight is conducted in the air.',
'The baloon is conducted in the air.',
'The bird is flying in the air.',
'The man is walking in the street.',
'The pedestrian is crossing the zebra.',
'The pilot flights the plane].',
'On the route, the car is driven.',
'On the road, the truck is moved.',
'The train is running on the tracks.',
'The bike is running on the pavement.',
'The flight takes place in the sky.',
'Birds don''t fly when is dark',
'The baloon is in the water.',
'The bird flies in the sky.',
'In the road, the guy walks.',
'The pedestrian is passing through the zebra.',
'The pilot is flying the plane.',
'This is a Japanese doll.',
'I really want to go to work, but I am too sick to drive.',
'Christmas is coming.',
'With the daylight saving time turned off it''s getting dark soon.',
'The body fat may compensates for the loss of nutrients.',
'Mary plays the piano.',
'She always speaks to him in a loud voice.',
'Wow, does that work?',
'I don''t like walking when it is dark',
'Last Friday in three week’s time I saw a spotted striped blue worm shake hands with a legless lizard.',
'My Mum tries to be cool by saying that she likes all the same things that I do.',
'Mummy is saying that she loves me being a pilot when in reality she is scared all the time I take off.',
'Where do random thoughts come from?',
'A glittering gem is not enough.',
'We need to rent a room for our party.',
'A purple pig and a green donkey flew a kite in the middle of the night and ended up sunburnt.',
'If I don’t like something, I’ll stay away from it.',
'The body may perhaps compensates for the loss of a true metaphysics.',
'Don''t step on the broken glass.',
'It was getting dark, and we weren''t there yet.',
'Playing an instrument like the guitar takes out the stress from my day.']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer='word',
max_df=0.8,
max_features=50000,
lowercase=True
)
X = vectorizer.fit_transform(corpus)
from sklearn.cluster import KMeans
num_clusters = 11
kmean = KMeans(n_clusters=num_clusters, random_state=1021)
clusters = kmean.fit_predict(X)
--
If you explore the clusters
variable, you will notice the overlaps I am talking about.
For instance the keyword baloon
appeara in both cluster 10 and 0.
There are 12 overlaps, which on a 33 unique keywords dataset represents 1/3, so I won't say something I could be happy with.
Any advice is appreciated. Thanks
Topic k-means clustering
Category Data Science