How do I fine-tune model performance after the initial run? (Scikit-Learn)

I've just started learning regression using scikit-learn and stumbled upon a problem. For a given dataset, let's say that I've imputed the missing data and one-hot encoded all categorical features. This point is where it starts getting confusing for me.

After hot-encoding categorical features, I usually end up with a lot of columns. How do I know that all of these columns benefit the model's performance? If not, how can I determine which columns/features to keep? Is there a method of determining the importance of these columns (their 'influence' to the model, perhaps?) or is it more of a trial and error situation?

While I understand that modeling is an iterative process, where even after the initial data analysis and modeling, the results from that first model must be used to improve the model by 'fine-tuning' the hyperparameters or data accordingly. However, I have no intuition/idea on what to do after the first model fitting. Ideally, how should one approach fine-tuning model parameters/ data configurations based on the model's initial run?

I would greatly appreciate some help.

Topic linear-regression beginner scikit-learn feature-selection

Category Data Science


Since you added that you use linear regression, few ideas (but still a very broad question):

How do I know that all of these columns benefit the model's performance? [...] how can I determine which columns/features to keep?

Have a look at Introduction to Statistical Learning (ISL) (Chapter 6.1). You can use stepwise (feature) selection for a start. The book comes with "R-Labs", so you can directly see how it is done (also available for Python).

Is there a method of determining the importance of these columns

Yes, use "shrinkage" of (standardized) features to see which features have a "strong" impact. This is Chapter 6.2 in ISL.

However, I have no intuition/idea on what to do after the first model fitting.

In linear models there is not much to be tuned. You can do feature selection / feature engineering /feature generation but apart from that there are no hyperparameters to be tuned. However, if you are up for a predictive model, make sure you have a proper test strategy. This is explained in Ch. 5 of ISL (" Resampling Methods").

The remaining chapters (i.e. Ch. 7, 8) give a good overview what you can do if you want to go beyond purely linear models. When you face stark non-linear data, you may look at "generalized additive models" (GAM, Ch. 7 in ISL). Often Random Forest is a good choice as well when the parameterization of the data is unclear (Ch. 8 in ISL).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.