Clustering Tweet Data using DBSCAN Algorithm
I am doing a tweet clustering using DBSCAN
algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps
with number of clusters.
The following are the parameters that I pass.
dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x)
The following are the resulting clusters.
label
-1 1221
0 1349
1 2
2 2
3 4
...
67 3
68 3
69 2
70 2
71 2
What is the reason that class 0 getting a high number of tweets than any other classes?
Topic python-3.x text dbscan scikit-learn clustering
Category Data Science