Clustering of sparse matrix with many co-variates

I have a 2M x 2000 sparse matrix where rows represent an item and columns represent dimensions. I want to understand whether there are meaningful clusters in the data and I started to explore the dimensions to transform and normalise the data.

Of the 2000 attributes to an item, many are co-variant (rho > .5). Are there clustering techniques that handle co-variants well automatically, without having to remove them manually?

Topic sparsity clustering

Category Data Science


You need to apply PCA and reduce your data to lower dimensions, then applying a classic clustering technique (e.g. k-means or DBSCAN) works depending on how your samples are distributed. I strongly recommend you visualize data in 2 or 3 dimensions and have a brief visual inspection. Gives you an insight about what is going on there. However the final number of dimensions you get out of PCA might be chosen to be more than 2 or 3 (it is usually).

Steps

  1. Normalize features if needed
  2. Apply PCA and take the number of PCs which explains 85% of variance
  3. Optional: Visualize embedded data in 2 or 3 dimension to get a feeling about distribution of samples (it does not anything more than just an intuition! Even that intuition can not be validated)
  4. Apply a clustering algorithm on the result of (2)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.