What is the Purpose of Feature Selection

I have a small medical dataset (200 samples) that contains only 6 cases of the condition I am trying to predict using machine learning. So far, the dataset is not proving useful for predicting the target variable and is resulting in models with 0% recall and precision, probably due to how small the dataset is.

However, in order to learn from the dataset, I applied Feature Selection techniques to deduct what features are useful in predicting the target variable and see if this supports or contradicts previous literature on the matter.

However, when I reran my models using the reduced dataset, this still resulted in 0% recall and precision. So the prediction performance has not improved. But the features returned by the applying Feature Selection have given me more insight into the data.

So my question is, is the purpose of Feature Selection:

  • to improve prediction performance
  • or can the purpose be identifying relevant features in the prediction and learning more about the dataset

So in other words, is Feature Selection just a tool for improved performance, or can it be an end in itself?

Also, if using the subset of features returned by Feature Selection methods does not improve the accuracy or recall of the model how can I demonstrate that these feature are indeed relevant in my prediction?

If you can link some resources about this issue that would be very useful.

Thank you.

Topic feature-selection python dimensionality-reduction machine-learning

Category Data Science


You partially answered your own question. Feature selection is for gaining insight into your problem, regardless of whether or not it is actually used in a model. This is particularly important when using a small number of features, as you have stated, since you might expect importance to surface when doing modeling. However if it is contrary to what you expect, that is important as well, since it might indicate problems with sample size, measurement, etc.

Feature selection can also be used to improve performance, if you downplay interpretability, if you are willing to monitor the model, and optimize it when it degrades.

The difference between the two is that if you choose the 2nd method, and your model degrades, I think you will need to explain what is happening in terms of interpretability, or just reoptimize it and 'hope for the best' (not recommended). Many times companies don't care in your model is performing well, but will begin to question if it is not.

In the first case, you will always have an interpretable model, with (hopefully) acceptable performance. There are also techniques such as Lasso regression which enables you to perform some optimization, by shrinking the coefficient to an 'interpretation level' that is acceptable.

So both explainable AND performance is used nowadays for feature selection. Choice often depends upon the specific type of problem. Modeling for social and health issues require interpretation, while 'big data' types of problems often call for performance enhancing feature selection

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.