Using cross validation score to perform feature selection

So to perform my feature selection I ran cross validation over and over again, each time trying different subsets of my attributes and repeated this until I got the best cross validation score I could get. Is this alright to do or I am creating a major bias? I suspect that this could cause a bias and possibly result in data leakage because I am probably learning something about my test set by doing this, but how bad of a bias would this be? My data set is too small to create another validation set.

Topic cross-validation feature-selection

Category Data Science


The method itself is good, it's an optimization search over the possible features subsets. This is often done with exhaustive search or genetic search.

But you have the right intuition: at the end of this process, once you have picked the best subset of features, you must evaluate on an independent test set made of unseen data. The selection of the best subset of features is a form of training, so the performance that you obtain with CV is equivalent to performance on the training set.

It's impossible to know how bad it can be without evaluating on a fresh test set. But in general if you try a very large number of subsets it's unavoidable that there's some chance in the process, meaning the performance of the best subset is very likely to be overestimated (overfitting).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.