Searching machine learning algorithm for regression problem with many features
I have a machine learning problem with about 160 features and 400 cases and I want to find the best predictors for a continuous outcome. The dataset contains variables of psychotherapists and clients. I want to predict therapy outcome.
I used lasso regression in nested 20-fold cross-validation and could identify about 20 top predictors (model fit about 0.97 nrmse). (I decided not to create a seperate holdout dataset, because I have too few cases.)
However, I thought I could improve model performance with xgboost, but even though I used GridSearch (colsample_bytree=0.3, learning_rate=0.01, max_depth=2, n_estimators=1000), I did not manage to get to this fit (model fit about 1.01 nrmse). Is xgboost overfitting?
Right now, I am not sure how I can improve model performance. Do I
- need to use a dimension reduction technique (possibly pca?) before I employ ML?
- select the lasso top predictors as my new prediction features?
- use a different algorithm (possibly knn or svm?) on either 1. or 2. ?
Btw, I use the shap framework in python to assess feature importance. Therefore I am not bound to any algorithm to identify the best features.
Thanks for any help in advance!