GridSearchCV and time complexity
So, I was learning and trying to implement a GridSearch. I have a question regarding the following code, which I wrote:
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
dtc = DecisionTreeClassifier(random_state = 42, max_features = auto, class_weight = balanced)
clf = AdaBoostClassifier(base_estimator= dtc, random_state= 42)
parameters = {'base_estimator__criterion': ['gini', 'entropy'],
'base_estimator__splitter': ['best', 'random'],
'base_estimator__max_depth': list(range(1,4)),
'base_estimator__min_samples_leaf': list(range(1,4)),
'n_estimators': list(range(50, 500, 50)),
'learning_rate': list(range(0.5, 10)),
}
scorer = make_scorer(fbeta_score, beta=0.5)
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, n_jobs= -1)
grid_fit = grid_obj.fit(X_train, y_train)
best_clf = grid_fit.best_estimator_
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
So, I was learning and trying to implement a GridSearch. I have a question regarding the following code, which I wrote:
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
dtc = DecisionTreeClassifier(random_state = 42, max_features = auto, class_weight = balanced)
clf = AdaBoostClassifier(base_estimator= dtc, random_state= 42)
parameters = {'base_estimator__criterion': ['gini', 'entropy'],
'base_estimator__splitter': ['best', 'random'],
'base_estimator__max_depth': list(range(1,4)),
'base_estimator__min_samples_leaf': list(range(1,4)),
'n_estimators': list(range(50, 500, 50)),
'learning_rate': list(range(0.5, 10)),
}
scorer = make_scorer(fbeta_score, beta=0.5)
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, n_jobs= -1)
grid_fit = grid_obj.fit(X_train, y_train)
best_clf = grid_fit.best_estimator_
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
Is this overkill? I mean, I have a decent PC, or at least I think it is decent, but this piece of code took more than 5 hours to run and.... it did not! I had to stop it and write a simpler version with less parameters for the GridSearch in order for it to run and deliver some results. The simpler version was the following:
parameters = {'base_estimator__splitter': ['best'],
'base_estimator__max_depth': [2, 3, 5],
'n_estimators': [100, 250, 500],
'learning_rate': [1.0, 1.5, 2.0],
}
and it took 20 minutes.
I do not know much about GridSearch and I think that everything is fine with the code, I mean it is right with no errors, so it didn't delivered any result because it didnt ended running... Is there any good reference about time complexity on running gridsearch, and in particular when running in parameters for AdaBoost? I have no intuition for this, and after 3 or 4 hours I wasn't sure if the problem was that my PC was not strong enough, or the code would run forever because I did wrote something wrong or anything else. But since I changed, everything went fine, so the problem must be the time. Somebody told me to run a RandomizedSearch, which I never heard about until yesterday. It can be a solution. But is this always an issue with GridSearch? Are there any piece of code that can assist me on telling me how much parameters were searched by the algorithm or aything like that? Because it is frustrating to look at the screen and the clock and realize that 5 hours has passed and I have no idea of how much longer it will run or even if it will finish someday...
Topic time-complexity gridsearchcv
Category Data Science