Clustering Tweet Data using DBSCAN Algorithm

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters.

The following are the parameters that I pass.

dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x)

The following are the resulting clusters.

label
-1     1221
 0     1349
 1        2
 2        2
 3        4
       ... 
 67       3
 68       3
 69       2
 70       2
 71       2

What is the reason that class 0 getting a high number of tweets than any other classes?

Topic python-3.x text dbscan scikit-learn clustering

Category Data Science


Two things: eps and quantitative representation of text.

You see that there is only for eps=0.15 a lot of clusters. But for others a lot less. This is hyper parameter that needs to be optimised (and min_samples)

And the other thing thats more important is what you use quantitative representation of text. You said Bag of Words, TFIDF, Spacy Vectors and also, Word2Vec, but did you tune them? DId you tree embeddings etc etc. There is a lot of improvement here, and when its good dbscan will function a lot better.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.