How to interpret a specific feature importance?

Apologies for a very case specific question. I have a dataset of genes, with which I am using machine learning to predict if a gene causes a disease. One of the features I have is a beta value (which is the effect size of the gene's impact on the disease), and I'm not sure how best to interpret and use this feature.

I condense the beta values from the variant level to the gene level, so a gene is left with multiple beta values like this:

Gene         Beta
ACE      -0.7, 0.1 ,0.6
NOS      0.2, 0.4, 0.5
BRCA     -0.1 ,0.1, 0.2

Currently I am trying 2 options of selecting a single beta value per gene, one where I select the absolute value per gene (and ignore whether it was a previous negative value) and another where I select the absolute value and return the previous negative numbers back to being negative. I am trying this as for beta values a postive or negative direction indicates the size of the effect a gene has on the disease, so I would think it's important to retain the negative information (as I understand it).

However, I've been advised to use just the absolute values with not retaining negative status, and I'm not sure if there's a way for me to know if one option is better than the other from the machine learning perspective. I am also having a problem in either case where my model values this feature as much more important than any other feature in my dataset. For example gradient boosting gives this an importance of 0.01, the next most important feature is at 0.001.

So my question is, how best can I interpret a highly important feature like this? If it is much more important is it actually a bias and is it likely due to my own handling/preprocessing of the feature or is it acceptable that is it just very important? Would it be possible for me to set my model to re-weight the importance of this particular feature? I have a biology background so not sure what is the normal or least biased approach.

Topic bioinformatics feature-selection machine-learning

Category Data Science


You can use one of 2 approaches:

The 1st is unsupervised:

Use PCA algorithm to extract the feature vectors best representing the dataset variance. The PCA algorithm extract new features which each of them is a linear combination out of the other features (independent from the label) when the 1st feature it extracts is most important feature and last one is the least important. Then you can retrieve the weights of each "Beta" value at the most important feature. Here is an example for that: https://stackoverflow.com/a/34692511/6677037

Another approach is the supervised:

using the labels, which you should use carefully and not choose the features based on the test set. With these methods you can see the most important features using "Xi Square" , "mutual information gain" etc. Then you remove the least important features. here is the easiest way to do that: https://hub.packtpub.com/4-ways-implement-feature-selection-python-machine-learning/

good luck.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.