How to choose the random seed?

I understand this question can be strange, but how do I pick the final random_seed for my classifier?

Below is an example code. It uses the SGDClassifier from SKlearn on the iris dataset, and GridSearchCV to find the best random_state:

from sklearn.linear_model import SGDClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV

iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


parameters = {'random_state':[1, 42, 999, 123456]}

sgd = SGDClassifier(max_iter=20, shuffle=True)
clf = GridSearchCV(sgd, parameters, cv=5)

clf.fit(X_train, y_train)

print("Best parameter found:")
print(clf.best_params_)
print("\nScore per grid set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

The results are the following:

Best parameter found:
{'random_state': 999}

Score per grid set:
0.732 (+/-0.165) for {'random_state': 1}
0.777 (+/-0.212) for {'random_state': 42}
0.786 (+/-0.277) for {'random_state': 999}
0.759 (+/-0.210) for {'random_state': 123456}

In this case, the difference from the best to second best is 0.009 from the score. Of course, the train/test split also makes a difference.

This is just an example, where one could argue that it doesn't matter which one I pick. The random_state should not affect the working of the algorithm. However, there is nothing impeding of a scenario where the difference from the best to the second best is 0.1, 0.2, 0.99, a scenario where the random_seed makes a big impact.

  • In the case where the random_seed makes a big impact, is it fair to hyper-parameter optimize it?
  • When is the impact too small to care?

Topic hyperparameter-tuning randomized-algorithms hyperparameter

Category Data Science


TL:DR, I would suggest not to optimise over the random seed. A better investment of the time would be to improve other parts of your model, such as the pipeline, the underlying algorithms, the loss function... heck, even optimise the runtime performance! :-)


This is an interesting question, even though (in my opinion) should not be a parameter to optimise.

I can imagine that researchers, in their struggles to beat current state-of-the-art on benchmarks such as ImageNet, may well run the same experiments many times with different random seeds, and just pick/average the best. However, the difference should not be considerable.

If your algorithms has enough data, and goes through enough iterations, the impact of the random seed should tend towards zero. Of course, as you say, it may have a huge impact. Imagine I am categorising a batch of images, into cat or dog. If I have a batch size of 1, and only 2 images that are randomly sampled, and one is correctly classified, one is not, then the random seed governing which is selected will determine whether or not I get 100% or 0% acuracy on that batch.


Some more basic information:

The use of a random seed is simply to allow for results to be as (close to) reproducible as possible. All random number generators are only pseudo-random generators, as in the values appear to be random, but are not. In essence, this can be logically deduced as (non-quantum) computers are deterministic machines, and so if given the same input, will always produce the same output. Have a look here for some more information and relative links to literature.


You don't. It's random, you shouldn't control it. The parameter is only there so we can replicate experiments.

In cases of algorithms producing hugely different results with different randomness (such as the original K-Means [not the ++ version] and randomly seeded neural networks), it is common to run the algorithm multiple times and pick the one that performs best according to some metric. You can do that by just running the algorithm again, without re-seeding.

But do not treat the random seed as something you can control. If you want your model to be able to be replicated later, simply get the current seed (most operating systems use processor clock time I think) and store it. Choosing a random seed because it performs best is completely overfitting/happenstance.

Note this all assumes a decent implementation of a random number generator with a decent random seed. Some pairs of RNG and seed may produce some predictable or less than useful random sequences.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.