How to specify steps_per_epoch and validation steps on infinite dataset?

Question

How to specify steps_per_epoch and validation steps on infinite dataset?

DevLoverUmar

2020年11月23日 18:11

I have a huge csv dataset with a size of 200 GB. I'm using CsvDataset to make dataset generator for loading data from the disk while training the model. I want all the data to be passed on each epoch. So, what should I pass in the parameters steps_per_epoch and validation steps.

Here is my Keras model using the data_set.


    training_csvs =  sorted(str(p) for p in pathlib.Path('.').glob(path-to-data/Train_DS/*/*.csv))

    training_csvs

    training_dataset=tf.data.experimental.CsvDataset(        
        training_csvs,
        record_defaults=defaults, 
        compression_type=None, 
        buffer_size=None,
        header=True, 
        field_delim=',',
        # use_quote_delim=True,
        # na_value=,
        select_cols=selected_indices
        )
    
    print(type(training_dataset))
    for features in training_dataset.take(1):
        print(Training samples before mapping)
        print(features)
    
    validate_ds = training_dataset.map(preprocess).take(10).batch(100).repeat()
    train_ds = training_dataset.map(preprocess).skip(10).take(90).batch(100).repeat()


    model = tf.keras.Sequential([        
        layers.Dense(256,activation='elu'),  
        layers.Dense(128,activation='elu'),  
        layers.Dense(64,activation='elu'),  
        layers.Dense(1,activation='sigmoid') 
        ])
    history = model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                            metrics=['accuracy'])
    
    model.fit(train_ds,
        validation_data=validate_ds,
        validation_steps=20,  # I think it's wrong, on each epoch only 20 batches would be used...
        steps_per_epoch= 20,
        epochs=20,
        verbose=1
        )

Topic epochs keras tensorflow machine-learning

Category Data Science

Anoop A Nair · Accepted Answer · 2020年11月23日 18:11

This can be explained only if you specify the total number of samples in your entire data set. The 200 Gb file size might be because of a large number of features for each sample in the dataset. The is how steps_per_epoch and validation_steps work.

The training generator will yield steps_per_epoch batches.

When the epoch ends, the validation generator will yield validation_steps batches.

so usually the ball-park value for a decent number of batches in training is 75. and that of validation is 25. if youre having a train to validation split of 3:1 for a 1000 sample data set. so that each of train and validation set batches will have 10 samples each.

How to specify steps_per_epoch and validation steps on infinite dataset?

About