Feature selection for regression

Suppose I have a response variable y and and a set of feature variables (x1, x2 ... xn). I wish to find which of x1...xn are the best features for y in a regression problem (the relationship might not be linear).

Is there any way I can do this kind of feature selection without using any correlation measure or regression function in the process (i.e. I cannot use any filter or wrapper methods)?

Topic regression feature-selection

Category Data Science


If you do not want to use filter or wrapper feature selection methods, then you can use tree based algorithms to find the feature importance of all the features. You can use Random Forest, LighGBM, XGBoost, CatBoost for this purpose. CatBoost is an interesting one as it can work with categorical features.

Also you can use L1 or L2 regularization to select the best features. Read up on Ridge and Lasso algorithms.

A word of caution though. Feature selection methods are to be taken with a pinch of sale. Never solely rely on feature selection methods. The best feature selection method IMO is filtering out features based on domain knowledge.


Google up scikit learn feature selection. You should look to use the f_regression and mutual_info_regression to identify the best features for the problem at hand.


You can train a LightGBM Regressor. LightGBM Regressor feature selection methods embbed in it. You can direct plot them to see which features are important. See this link. LightGBM Plot Importance

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.