Fashion MNIST: Is there an easy way to extract only 1% of the data to do a minimal gridsearch?

Question

Fashion MNIST: Is there an easy way to extract only 1% of the data to do a minimal gridsearch?

ilam engl

2021年12月9日 12:40

I am trying implement several models on the fashion-MNIST. I have imported the data according to the tf.keras tutorial:

import tensorflow as tf
from tensorflow import keras
import sklearn
import numpy as np

f_mnist = keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = f_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']   
print(train_images)
print(train_labels)
(60000, 28, 28)
(60000,)

print(test_images)
print(test_labels)
(10000, 28, 28)
(10000,)
 
# Need to concatenate as GridsearchCV takes entire set in input
all_images = np.concatenate((train_images, test_images))
all_labels = np.concatenate((train_labels, test_labels))

print(all_images.shape)
print(all_labels.shape)
(70000, 28, 28)
(70000,)

The 10 labels are equally distributed in both training and testing set:

Since this is only to practice I would like to implement a minimal grid search but instead of using the entire set of 70 000 samples I'd like to extract only say 1% to do a grid search on that.

That way I can learn how it works without spending to much time on the computation.

The tutorials I see use however only the from skelearn.model_selection import GridSearchCV module which takes the entire set as input:

# Splitting the entire set into train and test
X_train, X_test, y_train, y_test = train_test_split(all_images,all_labels, 
test_size=0.3, random_state = 101)

parameters_grid={'C':[0.001, 0.01, 0.1, 1, 10], 'gamma': [1, 0.1, 
0.01, 0.001, 0.0001],
            'kernel': ['rbf']}
grid=GridSearchCV(SVC(),parameters_grid, refit = True, verbose = 3)
grid.fit( )

So far the only work around I could think of is to use only the set of test_images as it is smaller. But I guess it would still run for a while, given that it contains 10 000 images...

I also thought about changing the function to use a just a smaller portion for training like so:

# Splitting the entire set into train and test
X_train, X_test, y_train, y_test  = train_test_split(test_images, test_labels, test_size=0.99, random_state = 101)

That way I'd use only the test_images that hold only 10 000 samples. I think this would lead to the models being trained on only 1% of the 10 000 and the rest will just be used for testing.

Is there an better python-way to extract only 1% of all_images or test_images with the corresponding all_labels or test_labels?

Obviously I would build the final model feeding all 60 000 training samples and subsequently test it on the 10 000.

I googled and talked to colleagues but no hits or answers.

Topic grid-search keras scikit-learn machine-learning

Category Data Science

Dave · Accepted Answer · 2021年12月9日 11:26

There are $60000$ images, and you want a subset of size $N$.

When I have had to do this, I have randomly drawn unique integers and used them as indices.

Once you have your $N$ unique indices, apply those to the image tensor to select the images, and then apply them to the labels to select the corresponding labels.

Now you have your subset of images with their labels.

To ensure that you get enough of each label or don’t miss a label, you could do this ten times, once per label, and then concatenation at the end to form your entire data set. Unless you’re picking a tiny $N$, you’re likely to get a pretty representative sample with labels distributed in approximately the correct $10/10/10/10/10/10/10/10/10/10$.

Fashion MNIST: Is there an easy way to extract only 1% of the data to do a minimal gridsearch?

About