What are some strategies to deal with label sparsity when training a protein function prediction model?

The protein function prediction task requires you to take a sequence of amino acids (think words in a sentence, but if there are only 20 words), and output the functions that protein can take. There are around 30 thousand labels for protein function, and these labels are not mutually exclusive, so protein function prediction is essentially a huge binary prediction multitask. Now the catch is some labels are very common, and others are very rare, and overall a protein is only labeled with a small number of labels out of the 30 thousand labels that biologists came up with in total. So if you represent each protein's functions as a binary vector, where each 1 entry represents a function it carries, then all vectors will be extremely sparse, and the majority of labels will only have a few positive examples in the training set, which leads to extreme label imbalance. What are some strategies I can use to deal with these problems?

Topic bioinformatics class-imbalance

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.