GridSearchCV and time complexity

Question

GridSearchCV and time complexity

Dimitri

2021年6月24日 01:36

So, I was learning and trying to implement a GridSearch. I have a question regarding the following code, which I wrote:

from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

dtc = DecisionTreeClassifier(random_state = 42, max_features = auto, class_weight = balanced)
clf = AdaBoostClassifier(base_estimator= dtc, random_state= 42)

parameters = {'base_estimator__criterion': ['gini', 'entropy'],
      'base_estimator__splitter': ['best', 'random'],
      'base_estimator__max_depth': list(range(1,4)),
      'base_estimator__min_samples_leaf': list(range(1,4)),
      'n_estimators': list(range(50, 500, 50)),
      'learning_rate': list(range(0.5, 10)),
      }

scorer = make_scorer(fbeta_score, beta=0.5)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer, n_jobs= -1)

grid_fit = grid_obj.fit(X_train, y_train)

best_clf = grid_fit.best_estimator_

predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

So, I was learning and trying to implement a GridSearch. I have a question regarding the following code, which I wrote:

from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

dtc = DecisionTreeClassifier(random_state = 42, max_features = auto, class_weight = balanced)
clf = AdaBoostClassifier(base_estimator= dtc, random_state= 42)

parameters = {'base_estimator__criterion': ['gini', 'entropy'],
          'base_estimator__splitter': ['best', 'random'],
          'base_estimator__max_depth': list(range(1,4)),
          'base_estimator__min_samples_leaf': list(range(1,4)),
          'n_estimators': list(range(50, 500, 50)),
          'learning_rate': list(range(0.5, 10)),
          }

scorer = make_scorer(fbeta_score, beta=0.5)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer, n_jobs= -1)

grid_fit = grid_obj.fit(X_train, y_train)

best_clf = grid_fit.best_estimator_

predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

Is this overkill? I mean, I have a decent PC, or at least I think it is decent, but this piece of code took more than 5 hours to run and.... it did not! I had to stop it and write a simpler version with less parameters for the GridSearch in order for it to run and deliver some results. The simpler version was the following:

parameters = {'base_estimator__splitter': ['best'],
      'base_estimator__max_depth': [2, 3, 5],
      'n_estimators': [100, 250, 500],
      'learning_rate': [1.0, 1.5, 2.0],
      }

and it took 20 minutes.

I do not know much about GridSearch and I think that everything is fine with the code, I mean it is right with no errors, so it didn't delivered any result because it didnt ended running... Is there any good reference about time complexity on running gridsearch, and in particular when running in parameters for AdaBoost? I have no intuition for this, and after 3 or 4 hours I wasn't sure if the problem was that my PC was not strong enough, or the code would run forever because I did wrote something wrong or anything else. But since I changed, everything went fine, so the problem must be the time. Somebody told me to run a RandomizedSearch, which I never heard about until yesterday. It can be a solution. But is this always an issue with GridSearch? Are there any piece of code that can assist me on telling me how much parameters were searched by the algorithm or aything like that? Because it is frustrating to look at the screen and the clock and realize that 5 hours has passed and I have no idea of how much longer it will run or even if it will finish someday...

Topic time-complexity gridsearchcv

Category Data Science

Erwan · Accepted Answer · 2021年6月24日 01:36

There's no difficult time complexity issue, you just need to understand what GridSearchCV does, it will clarify everything. It's actually very simple:

sklearn documentation says that GridSearchCV performs an "exhaustive search over specified parameter values for an estimator". An exhaustive search means that the algorithm simply iterates over all the possibilities in the search space. The search space consists of all the combination of values for the parameters, that is:

base_estimator__criterion: 2,
base_estimator__splitter: 2,
base_estimator__max_depth: 3,
base_estimator__min_samples_leaf: 3,
n_estimators: 9,
learning_rate: I don't know because range(0.5, 10) gives an error, let's say 5 for instance.

Now that gives us $2*2*3*3*9*5 = 1620$ combinations of parameters. By default GridSearchCV uses 5-fold CV, so the function will train the model and evaluate it $1620*5=8100$ times. Of course the time taken depends on the size and complexity of the data, but even if it takes only 10 seconds for a single training/test process that's still around 81000 sec = 1350 mn = 22.5 hours.

By comparison, your second set of parameters contains only $3*3*3=27$ combinations. Since it took 20 minutes, we can estimate that a single combination (full CV) requires 0.7 mn to run. This means that in the first case above you would need $1620*.7=1200$ mn = 20 hours for the full process (approximately).

The usual solutions are:

decrease the number of combinations, as you did in the second case
parallelize the process with the n_jobs argument, that will roughly divide the time by $n$. Of course it works only if you have multiple cores and you don't plan to use them during this time ;)

GridSearchCV and time complexity

About