How to make a classification problem into a regression problem?

I have data describing genes which each get 1 of 4 labels, I use this to train models to predict/label other unlabelled genes. I have a huge class imbalance with 10k genes in 1 label and 50-100 genes in the other 3 labels. Due to this imbalance I'm trying to change my labels into numeric values for a model to predict a score rather than a label and reduce bias.

Currently from my 4 labels (of most likely, likely, possible, and least likely to affect a disease) I convert the 4 labels into scores between 0-1: most likely: 0.9, likely: 0.7, possible: 0.4, and least likely: 0.1 (decided based on how similar the previous label definitions were in their data). I've been using scatter plots with a linear model to try to understand which model would best fit my data and reducing overfitting, but not sure if there's more I can infer from this except that the data has homoskedasticity (I think? I have a biology background so learning as I go):

I'm not sure if there is a more official way I should be approaching this or if this regression conversion is problematic in ways I don't realise? Should I be trying to develop more scores in my training data or is there something else I should be doing?

Edit for more information: The current 4 labels I have I create based on drug knowledge of the genes and the drug data I currently have for each gene, I could incorporate other biological knowledge I have to make further labels I think. For example, currently the 'most likely' labelled genes are labelled as such because they are drug targets for my disease, 'likely' label because they are genes which interact with drugs to cause a side effect which leads to the disease, and the other 2 labels go down in relevance until there are least likely genes with no drug related or statistical evidence to cause the disease.

Topic bioinformatics regression classification machine-learning

Category Data Science


So, the direct answer here is clearly NO.

The answer comes from the definitions of classification and regression. In a classification task what a model predicts is the probability of an instance to belong to a class (e.g. 'image with clouds' vs 'image without clouds' ), in regression you are trying to predict continuous values (e.g. the level of 'cloudiness' of an image).

Sometimes you can turn a regression problem into a classification one. For example if I have a dataset of images with labels of their cloudiness level from 0 to 5 I could define a threshold, i.e. 2.5 and use it to turn the continuous values into discrete one, and use those discrete values as classes (cloudiness level < 2.5 equal image without clouds) but the opposite is definitely not possible.

Here's also a link to a similar question Could I turn a classification problem into regression problem by encoding the classes?

To solve the problem of imbalanced classes there are many ways, not just oversampling, you can generate artificial data, add class weights in the loss function, use active learning to gather new data, use models that returns uncertainty score for each prediction (like bayesian networks), I'm sure here is plenty of answers and strategy you can try.


Yes, you can go this route, using regression rather than classification, but you should one-hot encode your classes. This means that your model will have 4 outputs (alternatively, you can think of it as having 4 models). The first output will be the certainty that label1 applies, the second output label2, etc.

for example, if you have 10 datapoints with labels: 1,2,3,4,2,4,3,1,1,2, your "one-hot" encoded labels look like this:

1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 1 0 0
0 0 0 1
0 0 1 0
1 0 0 0
1 0 0 0
0 1 0 0

And a prediction for one data point could look like this:

0.445 0.129 1.234 -0.231

This datapoint has a high probability of having label3, but also a little probability of having label1.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.