Correlation/distance between sparse vectors

I am looking for a metric for comparing gene count tables. These are long columns of data (a few millions genes by a few dozen samples), with all non-negative entries, about 90% of which are zeros. The goal is to compare the performance of several tools/algorithms that these tables originate from, by comparing the resulting tables among themselves or with the expected counts (in a case of sumulates data). In principle, one compares on a sample-by-sample basis, but comparing different samples might be also of interest, e.g., to filter out spurious correlations.

What I am using now is Spearman rank coefficient, taking account for the fact that some entries have identical ranks (certainly the zeros). I am looking for an approach more adapted to this setting (and preferably robust to outliers) and will appreciate suggestions.

Topic sparse spearmans-rank-correlation distance correlation

Category Data Science


The first idea that comes to mind is a similarity measure such as cosine. It's often used with sparse vectors (text represented as vectors over the vocabulary). There are many options for distance/similarity measures:

  • Basic set measures like overlap coefficient or Jaccard
  • Entropy-based measures such as KL divergence.
  • ...

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.