KDE on TF-IDF - sensitive bandwidth

Question

KDE on TF-IDF - sensitive bandwidth

Adam

2022年5月9日 12:03

I am clustering text based on TF-IDF features and DBSCAN (density based), and trying to rank points based on their 'belonging' to the cluster. Since my clustering is density based and my points can spread very randomly, I found Kernel Density Estimators relevant.

However, the scores of KDE are very sensitive to the choice of bandwidth hyper-parameter, which I could not pre-estimate. Most bandwidth values end up with either infinite score for points outside the cluster, of zero for the points in the cluster. I need a way to 'automatically' choose the bandwidth, so that my results will yield scores that make sense for the points in the cluster (larger values) and the points outside of it (smaller values).

I tried:

Both Silverman and Scott factors methods to evaluate the bandwidth depending on #points and #features, both were far from relevant in my case
GridSearchCV returns the minimal bandwidth in the grid
Different kernel types (all the relevant ones are similarly sensitive)
Reduce dimension, but as expected this severely hurt the KDE results without making the bandwidth that less sensitive

from sklearn.neighbors import KernelDensity

# indexes of points in cluster3
docs = np.where(y_pred == 3)[0] 

kde = KernelDensity(kernel='gaussian',bandwidth=0.399).fit(X_tfidf[docs].todense())

# evaluate scores on all points
scores = np.exp(kde.score_samples(X_tfidf.todense()))

Note that there are ~2200 features in the TF-IDF, a few dozen points (40-120) in each cluster the KDE was fitted to, and about 4000 points in total.

Any ideas on anything (even beyond KDE) are welcome, thank you.

Topic tfidf scikit-learn clustering

Category Data Science

David Marx · Accepted Answer · 2018年1月22日 11:07

1

David Marx answered at 2018年1月22日 11:07

GridSearchCV returns the minimal bandwidth in the grid

Then alter your grid to have a lower lower bound.

KDE on TF-IDF - sensitive bandwidth

About