How to specify steps_per_epoch and validation steps on infinite dataset?

I have a huge csv dataset with a size of 200 GB. I'm using CsvDataset to make dataset generator for loading data from the disk while training the model. I want all the data to be passed on each epoch. So, what should I pass in the parameters steps_per_epoch and validation steps.

Here is my Keras model using the data_set.


    training_csvs =  sorted(str(p) for p in pathlib.Path('.').glob(path-to-data/Train_DS/*/*.csv))

    training_csvs

    training_dataset=tf.data.experimental.CsvDataset(        
        training_csvs,
        record_defaults=defaults, 
        compression_type=None, 
        buffer_size=None,
        header=True, 
        field_delim=',',
        # use_quote_delim=True,
        # na_value=,
        select_cols=selected_indices
        )
    
    print(type(training_dataset))
    for features in training_dataset.take(1):
        print(Training samples before mapping)
        print(features)
    
    validate_ds = training_dataset.map(preprocess).take(10).batch(100).repeat()
    train_ds = training_dataset.map(preprocess).skip(10).take(90).batch(100).repeat()


    model = tf.keras.Sequential([        
        layers.Dense(256,activation='elu'),  
        layers.Dense(128,activation='elu'),  
        layers.Dense(64,activation='elu'),  
        layers.Dense(1,activation='sigmoid') 
        ])
    history = model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                            metrics=['accuracy'])
    
    model.fit(train_ds,
        validation_data=validate_ds,
        validation_steps=20,  # I think it's wrong, on each epoch only 20 batches would be used...
        steps_per_epoch= 20,
        epochs=20,
        verbose=1
        )

Topic epochs keras tensorflow machine-learning

Category Data Science


This can be explained only if you specify the total number of samples in your entire data set. The 200 Gb file size might be because of a large number of features for each sample in the dataset. The is how steps_per_epoch and validation_steps work.

The training generator will yield steps_per_epoch batches.

When the epoch ends, the validation generator will yield validation_steps batches.

so usually the ball-park value for a decent number of batches in training is 75. and that of validation is 25. if youre having a train to validation split of 3:1 for a 1000 sample data set. so that each of train and validation set batches will have 10 samples each.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.