sklearn & Meanshift for NLP only returns 1 cluster
I am using sklearn.clustering
to work with some text data and the MeanShift algorithm. I have:
- Done all standard NLP data prep like lemmatizing, removing stop words, etc.
- Used the TfidfVectorizer to create my word vectors on 80k-plus records
- The vectorizer gives me a sparse array so I converted it using a standard
.toarray()
command - I made a call to sklearn Meanshift and then accepted all of the default parameters. The call looks like
meanshift = MeanShift().fit(fitted_vector_data.toarray())
and results in the following output when I call the model:MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1, n_jobs=1, seeds=None)
The problem is that no matter what data I pass in (whether it's 10 records or 10k records, it always just gives me 1 cluster when I should be getting hundreds of clusters.
This is my first time using MeanShift, so I'm guessing there is a problem with how I'm setting up my data and/or parameters? I should also point out, I have used other models like k-means and affinity propogation - with the same data prep - and those models gave multiple clusters.
Topic mean-shift unsupervised-learning scikit-learn nlp clustering
Category Data Science