How to make a classification problem into a regression problem?
I have data describing genes which each get 1 of 4 labels, I use this to train models to predict/label other unlabelled genes. I have a huge class imbalance with 10k genes in 1 label and 50-100 genes in the other 3 labels. Due to this imbalance I'm trying to change my labels into numeric values for a model to predict a score rather than a label and reduce bias.
Currently from my 4 labels (of most likely, likely, possible, and least likely to affect a disease) I convert the 4 labels into scores between 0-1: most likely: 0.9, likely: 0.7, possible: 0.4, and least likely: 0.1 (decided based on how similar the previous label definitions were in their data). I've been using scatter plots with a linear model to try to understand which model would best fit my data and reducing overfitting, but not sure if there's more I can infer from this except that the data has homoskedasticity (I think? I have a biology background so learning as I go):
I'm not sure if there is a more official way I should be approaching this or if this regression conversion is problematic in ways I don't realise? Should I be trying to develop more scores in my training data or is there something else I should be doing?
Edit for more information: The current 4 labels I have I create based on drug knowledge of the genes and the drug data I currently have for each gene, I could incorporate other biological knowledge I have to make further labels I think. For example, currently the 'most likely' labelled genes are labelled as such because they are drug targets for my disease, 'likely' label because they are genes which interact with drugs to cause a side effect which leads to the disease, and the other 2 labels go down in relevance until there are least likely genes with no drug related or statistical evidence to cause the disease.
Topic bioinformatics regression classification machine-learning
Category Data Science