Searching machine learning algorithm for regression problem with many features

I have a machine learning problem with about 160 features and 400 cases and I want to find the best predictors for a continuous outcome. The dataset contains variables of psychotherapists and clients. I want to predict therapy outcome.

I used lasso regression in nested 20-fold cross-validation and could identify about 20 top predictors (model fit about 0.97 nrmse). (I decided not to create a seperate holdout dataset, because I have too few cases.) However, I thought I could improve model performance with xgboost, but even though I used GridSearch (colsample_bytree=0.3, learning_rate=0.01, max_depth=2, n_estimators=1000), I did not manage to get to this fit (model fit about 1.01 nrmse). Is xgboost overfitting? Right now, I am not sure how I can improve model performance. Do I

  1. need to use a dimension reduction technique (possibly pca?) before I employ ML?
  2. select the lasso top predictors as my new prediction features?
  3. use a different algorithm (possibly knn or svm?) on either 1. or 2. ?

Btw, I use the shap framework in python to assess feature importance. Therefore I am not bound to any algorithm to identify the best features.

Thanks for any help in advance!

Topic feature-importances xgboost regression feature-selection dimensionality-reduction

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.