Correlation/distance between sparse vectors

Question

Correlation/distance between sparse vectors

Roger Vadim

2021年1月20日 13:52

I am looking for a metric for comparing gene count tables. These are long columns of data (a few millions genes by a few dozen samples), with all non-negative entries, about 90% of which are zeros. The goal is to compare the performance of several tools/algorithms that these tables originate from, by comparing the resulting tables among themselves or with the expected counts (in a case of sumulates data). In principle, one compares on a sample-by-sample basis, but comparing different samples might be also of interest, e.g., to filter out spurious correlations.

What I am using now is Spearman rank coefficient, taking account for the fact that some entries have identical ranks (certainly the zeros). I am looking for an approach more adapted to this setting (and preferably robust to outliers) and will appreciate suggestions.

Topic sparse spearmans-rank-correlation distance correlation

Category Data Science

Erwan · Accepted Answer · 2021年1月20日 12:19

The first idea that comes to mind is a similarity measure such as cosine. It's often used with sparse vectors (text represented as vectors over the vocabulary). There are many options for distance/similarity measures:

Basic set measures like overlap coefficient or Jaccard
Entropy-based measures such as KL divergence.
...

Correlation/distance between sparse vectors

About