Clustering of sparse matrix with many co-variates

Question

Clustering of sparse matrix with many co-variates

Strabonio

2020年2月17日 08:11

I have a 2M x 2000 sparse matrix where rows represent an item and columns represent dimensions. I want to understand whether there are meaningful clusters in the data and I started to explore the dimensions to transform and normalise the data.

Of the 2000 attributes to an item, many are co-variant (rho > .5). Are there clustering techniques that handle co-variants well automatically, without having to remove them manually?

Topic sparsity clustering

Category Data Science

Kasra Manshaei · Accepted Answer · 2020年2月17日 08:11

You need to apply PCA and reduce your data to lower dimensions, then applying a classic clustering technique (e.g. k-means or DBSCAN) works depending on how your samples are distributed. I strongly recommend you visualize data in 2 or 3 dimensions and have a brief visual inspection. Gives you an insight about what is going on there. However the final number of dimensions you get out of PCA might be chosen to be more than 2 or 3 (it is usually).

Steps

Normalize features if needed
Apply PCA and take the number of PCs which explains 85% of variance
Optional: Visualize embedded data in 2 or 3 dimension to get a feeling about distribution of samples (it does not anything more than just an intuition! Even that intuition can not be validated)
Apply a clustering algorithm on the result of (2)

Clustering of sparse matrix with many co-variates

Steps

About