Best metric and hyperparameters in dimension reduction with UMAP for binary sparse data

I am playing with a dimensionality reduction step prior to clustering for a pretty large sparse binary matrix of almost 3000 columns and 50k rows.

My idea is to embed the 3000 dimensions into a two-dimensional space with UMAP and then cluster the resulting 50,000 two-dimensional points with HDBScan.

I've found that UMAP accepts a number of options, such as the metric, n_neighbors, min_dist and spread, but I cannot figure out what should be the best combination giving me distinct clusters. Is there any advice or best-practice on dimension reduction with UMAP for binary data that can work best in most cases, or am I expected to play with parameters until I cannot find a decent combination?

Topic sparse binary dimensionality-reduction

Category Data Science


The first step would be to take a random sample of ~3000 rows, so that you can try several options and find good ones quickly, before taking into account the whole dataset.

Note: in most cases where you have a natural distribution, a random sample is quite representative of its whole dataset, even if it is 5%.

Then the most representative option is "n_neighbors", because it will calculate the density of correlated points: a very low value (ex:2) will have as very concentrated clusters, whereas a very high value (ex: 200) will have very sparse clusters. The best value would be something in between (maybe 50 or 100), after a few attempts on your sample.

"min_dist" is the minimal distance between points in the lower dimension, generally speaking, it should be 0.0 if you want clear clusters.

The "metric" for binary data are the following ones:

  • hamming
  • jaccard
  • dice
  • russellrao
  • kulsinski
  • rogerstanimoto
  • sokalmichener
  • sokalsneath
  • yule

The best choice depends on your business area. Hamming works well in general.

You can also play with the different parameters with this website: https://pair-code.github.io/understanding-umap/

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.