Best metric and hyperparameters in dimension reduction with UMAP for binary sparse data
I am playing with a dimensionality reduction step prior to clustering for a pretty large sparse binary matrix of almost 3000 columns and 50k rows.
My idea is to embed the 3000 dimensions into a two-dimensional space with UMAP and then cluster the resulting 50,000 two-dimensional points with HDBScan.
I've found that UMAP accepts a number of options, such as the metric
, n_neighbors
, min_dist
and spread
, but I cannot figure out what should be the best combination giving me distinct clusters.
Is there any advice or best-practice on dimension reduction with UMAP for binary data that can work best in most cases, or am I expected to play with parameters until I cannot find a decent combination?
Topic sparse binary dimensionality-reduction
Category Data Science