Best metric and hyperparameters in dimension reduction with UMAP for binary sparse data

Question

Best metric and hyperparameters in dimension reduction with UMAP for binary sparse data

linello

2021年7月27日 14:47

I am playing with a dimensionality reduction step prior to clustering for a pretty large sparse binary matrix of almost 3000 columns and 50k rows.

My idea is to embed the 3000 dimensions into a two-dimensional space with UMAP and then cluster the resulting 50,000 two-dimensional points with HDBScan.

I've found that UMAP accepts a number of options, such as the metric, n_neighbors, min_dist and spread, but I cannot figure out what should be the best combination giving me distinct clusters. Is there any advice or best-practice on dimension reduction with UMAP for binary data that can work best in most cases, or am I expected to play with parameters until I cannot find a decent combination?

Topic sparse binary dimensionality-reduction

Category Data Science

Nicolas Martin · Accepted Answer · 2021年7月27日 14:47

The first step would be to take a random sample of ~3000 rows, so that you can try several options and find good ones quickly, before taking into account the whole dataset.

Note: in most cases where you have a natural distribution, a random sample is quite representative of its whole dataset, even if it is 5%.

Then the most representative option is "n_neighbors", because it will calculate the density of correlated points: a very low value (ex:2) will have as very concentrated clusters, whereas a very high value (ex: 200) will have very sparse clusters. The best value would be something in between (maybe 50 or 100), after a few attempts on your sample.

"min_dist" is the minimal distance between points in the lower dimension, generally speaking, it should be 0.0 if you want clear clusters.

The "metric" for binary data are the following ones:

hamming
jaccard
dice
russellrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule

The best choice depends on your business area. Hamming works well in general.

You can also play with the different parameters with this website: https://pair-code.github.io/understanding-umap/

Best metric and hyperparameters in dimension reduction with UMAP for binary sparse data

About