How to choose the random seed?
I understand this question can be strange, but how do I pick the final random_seed for my classifier?
Below is an example code. It uses the SGDClassifier from SKlearn on the iris dataset, and GridSearchCV to find the best random_state:
from sklearn.linear_model import SGDClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
parameters = {'random_state':[1, 42, 999, 123456]}
sgd = SGDClassifier(max_iter=20, shuffle=True)
clf = GridSearchCV(sgd, parameters, cv=5)
clf.fit(X_train, y_train)
print("Best parameter found:")
print(clf.best_params_)
print("\nScore per grid set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
The results are the following:
Best parameter found:
{'random_state': 999}
Score per grid set:
0.732 (+/-0.165) for {'random_state': 1}
0.777 (+/-0.212) for {'random_state': 42}
0.786 (+/-0.277) for {'random_state': 999}
0.759 (+/-0.210) for {'random_state': 123456}
In this case, the difference from the best to second best is 0.009 from the score. Of course, the train/test split also makes a difference.
This is just an example, where one could argue that it doesn't matter which one I pick. The random_state should not affect the working of the algorithm. However, there is nothing impeding of a scenario where the difference from the best to the second best is 0.1, 0.2, 0.99, a scenario where the random_seed makes a big impact.
- In the case where the
random_seedmakes a big impact, is it fair to hyper-parameter optimize it? - When is the impact too small to care?
Topic hyperparameter-tuning randomized-algorithms hyperparameter
Category Data Science