Clustering Tweet Data using DBSCAN Algorithm

Question

Clustering Tweet Data using DBSCAN Algorithm

Nilani Algiriyage

2022年4月29日 20:22

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters.

The following are the parameters that I pass.

dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x)

The following are the resulting clusters.

label
-1     1221
 0     1349
 1        2
 2        2
 3        4
       ... 
 67       3
 68       3
 69       2
 70       2
 71       2

What is the reason that class 0 getting a high number of tweets than any other classes?

Topic python-3.x text dbscan scikit-learn clustering

Category Data Science

Noah Weber · Accepted Answer · 2020年11月8日 13:49

Two things: eps and quantitative representation of text.

You see that there is only for eps=0.15 a lot of clusters. But for others a lot less. This is hyper parameter that needs to be optimised (and min_samples)

And the other thing thats more important is what you use quantitative representation of text. You said Bag of Words, TFIDF, Spacy Vectors and also, Word2Vec, but did you tune them? DId you tree embeddings etc etc. There is a lot of improvement here, and when its good dbscan will function a lot better.

Clustering Tweet Data using DBSCAN Algorithm

About