Very Fast Training After First Epoch

Question

Very Fast Training After First Epoch

tkarahan

2020年12月11日 00:12

I trained an InceptionV3 model using plant images. I used Keras library. When training was started, first epoch took 29s per step and then other steps took approximately 530ms per step. So that made me doubt whether there is a bug in my code. I checked my code several times, but its logic seems right to me. I trained my model on Google Colab. I wonder whether there is a memoization mechanism or my code contains bugs. Here my code:

# Yields one image-target pair when called
def image_target_generator(files, labels):
assert len(files) == len(labels), 'Files and labels sizes don\'t match!'

for step in range(len(files)):
    img = cv2.imread(dataset_path + files[step])
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    item = (img, labels[step])
    yield item

# Generating batch
 def batch_generator(gen):

    batch_images = []
    batch_targets = []

    for item in gen:

      if len(batch_images) == BATCH_SIZE:
        yield batch_images, batch_targets
        batch_images = []
        batch_targets = []

      preprocessed_img = preprocess_image(item[0])
      batch_images.append(preprocessed_img)
      batch_targets.append(item[1])      

    yield batch_images, batch_targets

# Training generator
def training_generator(files, labels):

  # So that Keras can loop it as long as required
  while True:

    for batch in batch_generator(image_target_generator(files, labels)):
      batch_images = np.stack(batch[0], axis=0)
      batch_targets = keras.utils.np_utils.to_categorical(batch[1], NUM_CLASSES)
      yield batch_images, batch_targets


# Create model
def create_model():
  model = keras.applications.InceptionV3(include_top=False, input_shape= IMG_SIZE, IMG_SIZE, 3), weights='imagenet')

  new_output = keras.layers.GlobalAveragePooling2D()(model.output)
  new_output = keras.layers.Dense(NUM_CLASSES, activation='softmax') (new_output)
  model = keras.engine.training.Model(model.inputs, new_output)

  for layer in model.layers:
    layer.Trainable = True

    if isinstance(layer, keras.layers.BatchNormalization):
      layer.momentum = 0.9

  for layer in model.layers[:-50]:
    if not isinstance(layer, keras.layers.BatchNormalization):
      layer.trainable = False

  return model

# Compiling model
model = create_model()

model.compile(loss='categorical_crossentropy',optimizer=keras.optimizers.adamax(lr=1e-2), metrics=['accuracy'])

# Fitting model
model.fit_generator(
  training_generator(train_x, train_y),
  steps_per_epoch=len(train_x) // BATCH_SIZE,
  epochs = 30,
  validation_data=training_generator(test_x, test_y),
  validation_steps=len(test_x) // BATCH_SIZE
  )

Topic colab inception training keras tensorflow

Category Data Science

Samuel Cueva Lozano · Accepted Answer · 2020年12月11日 00:12

1

Samuel Cueva Lozano answered at 2020年12月11日 00:12

The problem could be due to the time Tensorflow uses running cudnn benchmarks for each data and store them in cache. This occurs during the first epoch and due to a difference in the size of the images.

n1k31t4 · Accepted Answer · 2020年3月31日 20:53

Although there are valid points in the accepted answer, I believe it is incorrect in this case.

The timing differences mentioned were between the first epoch of training and the remaining epochs. The model and so the computational graph is compiled only once, when you call model.compile(), which is not part of the training itself.

The difference in timings is due to the data being generated and loaded into memory before and during the first epoch. Keras (or its backend) caches this data as much as it can, meaning all subsequent epoch train faster.

Brian Spiering · Accepted Answer · 2020年3月31日 17:37

Keras supports lazy execution. The create_model and model.compile code are not executed until it is absolutely required which is right before the first training epoch. That increased time for the first epoch includes building the TensorFlow computational graph based on the plan in your create_model function. All remaining epochs re-use the same computational graph which is why they are significantly faster.

Very Fast Training After First Epoch

About