Searching machine learning algorithm for regression problem with many features

Question

Searching machine learning algorithm for regression problem with many features

Christopher Lalk

2022年3月26日 23:20

I have a machine learning problem with about 160 features and 400 cases and I want to find the best predictors for a continuous outcome. The dataset contains variables of psychotherapists and clients. I want to predict therapy outcome.

I used lasso regression in nested 20-fold cross-validation and could identify about 20 top predictors (model fit about 0.97 nrmse). (I decided not to create a seperate holdout dataset, because I have too few cases.) However, I thought I could improve model performance with xgboost, but even though I used GridSearch (colsample_bytree=0.3, learning_rate=0.01, max_depth=2, n_estimators=1000), I did not manage to get to this fit (model fit about 1.01 nrmse). Is xgboost overfitting? Right now, I am not sure how I can improve model performance. Do I

need to use a dimension reduction technique (possibly pca?) before I employ ML?
select the lasso top predictors as my new prediction features?
use a different algorithm (possibly knn or svm?) on either 1. or 2. ?

Btw, I use the shap framework in python to assess feature importance. Therefore I am not bound to any algorithm to identify the best features.

Thanks for any help in advance!

Topic feature-importances xgboost regression feature-selection dimensionality-reduction

Category Data Science

Searching machine learning algorithm for regression problem with many features

About