How to choose the random seed?
I understand this question can be strange, but how do I pick the final random_seed
for my classifier?
Below is an example code. It uses the SGDClassifier
from SKlearn on the iris dataset
, and GridSearchCV
to find the best random_state
:
from sklearn.linear_model import SGDClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
parameters = {'random_state':[1, 42, 999, 123456]}
sgd = SGDClassifier(max_iter=20, shuffle=True)
clf = GridSearchCV(sgd, parameters, cv=5)
clf.fit(X_train, y_train)
print("Best parameter found:")
print(clf.best_params_)
print("\nScore per grid set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
The results are the following:
Best parameter found:
{'random_state': 999}
Score per grid set:
0.732 (+/-0.165) for {'random_state': 1}
0.777 (+/-0.212) for {'random_state': 42}
0.786 (+/-0.277) for {'random_state': 999}
0.759 (+/-0.210) for {'random_state': 123456}
In this case, the difference from the best to second best is 0.009
from the score. Of course, the train
/test
split also makes a difference.
This is just an example, where one could argue that it doesn't matter which one I pick. The random_state
should not affect the working of the algorithm. However, there is nothing impeding of a scenario where the difference from the best to the second best is 0.1
, 0.2
, 0.99
, a scenario where the random_seed
makes a big impact.
- In the case where the
random_seed
makes a big impact, is it fair to hyper-parameter optimize it? - When is the impact too small to care?
Topic hyperparameter-tuning randomized-algorithms hyperparameter
Category Data Science