GridSeachCV not performing well on ML models

    from sklearn.model_selection import GridSearchCV
    svm2=SVC()
    grid={
        'C': [0.1, 1, 10, 100, 1000],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'gamma': [1, 0.1, 0.01, 0.001, 0.0001]
        
    }
    svm_grid=GridSearchCV(estimator=svm2,param_grid=grid,cv=3,n_jobs=-1)
    svm_grid.fit(xtrain,ytrain)
    svm_grid.best_params_

OUTPUT

{'C': 1, 'gamma': 1, 'kernel': 'rbf'}

CODE

svm_grid.score(xtrain,ytrain)

0.9884434814012278

svm_grid.score(xtest,ytest)

0.8513708513708513

My question is even after performing GridSearch why the model is still overfitting and how can I further increase the accuracy and combat overfitting .

I am facing same issues with RandomForest in Gridsearch

    grid = {
                    'n_estimators': [10, 20, 40, 50, 100, 150, 200, 500],
                    'max_features': ['auto', 'sqrt'],
                    'max_depth': [3, 5, 7, 9, 11, 15],
                    'bootstrap': [True, False],
                    
        
                    }
    rf = RandomForestClassifier()
    rf_random = GridSearchCV(estimator = rf, param_grid = grid, cv = 3, verbose=2, n_jobs = -1)
    rf_random.fit(xtrain, ytrain)
    rf_random.score(xtrain,ytrain)

1.0

rf_random.score(xtest,ytest)

0.8427128427128427

I am not able to understand why is GridSearch not helping

Topic grid-search overfitting decision-trees random-forest svm

Category Data Science


In order to fix overfitting, you can try the following things:

1.) Cross validation

2.) Get more data (won't work everytime)

3.) Remove redundant features

4.) Early stopping rounds if you are using GBM or DL

5.) Regularization (for example Ridge or Lasso in the case of Linear Regression)

6.) Perform extensive Feature engineering


All GridSearch does is it looks for the best performing model among the parameters you have supplied it with. It won't fix overfitting for you.

Overfitting happens when the model is to well adjusted to the training data. In case of SVM the model with C=1000 would definitely overfit and that is why it was not the best one. C=0.1 would probably underfit and that is also why it was not the best one.

Random Forests are protected from overfitting. Breiman L (2001). “Random Forests.” Machine Learning, 45, 5–32.

I don't really see the issue of overfitting here. It is to be expected that the score of training set will be better then the one of the test set.

The parameters for SVM are way to extreme. You should narrow your search closer to the values in the output.

Edit:

Hyperparameter tuning is not the only tool for improving accuracy. If you are not satisfied with the results you can do some feature engineering or review your methodology this far.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.