KDE on TF-IDF - sensitive bandwidth

I am clustering text based on TF-IDF features and DBSCAN (density based), and trying to rank points based on their 'belonging' to the cluster. Since my clustering is density based and my points can spread very randomly, I found Kernel Density Estimators relevant.

However, the scores of KDE are very sensitive to the choice of bandwidth hyper-parameter, which I could not pre-estimate. Most bandwidth values end up with either infinite score for points outside the cluster, of zero for the points in the cluster. I need a way to 'automatically' choose the bandwidth, so that my results will yield scores that make sense for the points in the cluster (larger values) and the points outside of it (smaller values).

I tried:

  • Both Silverman and Scott factors methods to evaluate the bandwidth depending on #points and #features, both were far from relevant in my case
  • GridSearchCV returns the minimal bandwidth in the grid
  • Different kernel types (all the relevant ones are similarly sensitive)
  • Reduce dimension, but as expected this severely hurt the KDE results without making the bandwidth that less sensitive
from sklearn.neighbors import KernelDensity

# indexes of points in cluster3
docs = np.where(y_pred == 3)[0] 

kde = KernelDensity(kernel='gaussian',bandwidth=0.399).fit(X_tfidf[docs].todense())

# evaluate scores on all points
scores = np.exp(kde.score_samples(X_tfidf.todense())) 

Note that there are ~2200 features in the TF-IDF, a few dozen points (40-120) in each cluster the KDE was fitted to, and about 4000 points in total.

Any ideas on anything (even beyond KDE) are welcome, thank you.

Topic tfidf scikit-learn clustering

Category Data Science


GridSearchCV returns the minimal bandwidth in the grid

Then alter your grid to have a lower lower bound.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.