Are linear models better when dealing with too many features? If so, why?

Question

Are linear models better when dealing with too many features? If so, why?

dsbr_

2022年1月17日 10:12

I had to build a classification model in order to predict which what would be the user rating by using his/her review. (I was dealing with this dataset: Trip Advisor Hotel Reviews)

After some preprocessing, I compared the results of a Logistic Regression with a CatBoost Classifier, both of them with the defatult hyperparameters. The Logistic Regression gave me a better AUC and F1-score.

I've heard some colleagues saying that this happneed because linear models are better when dealing with too many features (they told me 500). Is that correct? Is there any relation regarding the number of features and the model performance? I think I missed something in the theory

Topic catboost feature-engineering linear-regression decision-trees logistic-regression

Category Data Science

Peter · Accepted Answer · 2022年1月17日 10:12

There is some important information missing in your question, i.e. what the standard parameters are and what kind of logistic regression you use.

When you use sklearn.linear_model.LogisticRegression, you will see in the docs that the first hyperparameter is the penalty which defaults to l2. This means that by default "shrinkage" of parameters is used. By using regularization, features which are not very helpful in predicting some outcome are "shrunken". This is exactly what you will do when you have "high dimensional" data (lot of features and not so many observations).

Tree based boosted models are not per se "bad" in high dimensional settings. However, in order to achieve good performance, it may be neccesary to introduce column subsampling and probably row subsampling ("stochastic gradient boosting") as well as "enough" boosting rounds (probably with a low learning rate). The reason for this is that you want to show the boosting algorithm as much information as possible while avoiding dominance of few "powerful" features. Since in each boosting round only "shallow" trees are grown (usually 5-8 splits), few "powerful" features will dominate in case you do not randomly sample columns (features) in each boosting round. Thus, important details may not be learned when few powerful features dominate.

You could inspect the feature importance from both models to see if there are strong differences.

Are linear models better when dealing with too many features? If so, why?

About