sklearn & Meanshift for NLP only returns 1 cluster

I am using sklearn.clustering to work with some text data and the MeanShift algorithm. I have:

  1. Done all standard NLP data prep like lemmatizing, removing stop words, etc.
  2. Used the TfidfVectorizer to create my word vectors on 80k-plus records
  3. The vectorizer gives me a sparse array so I converted it using a standard .toarray() command
  4. I made a call to sklearn Meanshift and then accepted all of the default parameters. The call looks like meanshift = MeanShift().fit(fitted_vector_data.toarray()) and results in the following output when I call the model: MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1, n_jobs=1, seeds=None)

The problem is that no matter what data I pass in (whether it's 10 records or 10k records, it always just gives me 1 cluster when I should be getting hundreds of clusters.

This is my first time using MeanShift, so I'm guessing there is a problem with how I'm setting up my data and/or parameters? I should also point out, I have used other models like k-means and affinity propogation - with the same data prep - and those models gave multiple clusters.

Topic mean-shift unsupervised-learning scikit-learn nlp clustering

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.