Training an ensemble of small neural networks efficiently in TensorFlow 2

Question

Training an ensemble of small neural networks efficiently in TensorFlow 2

MightyCurious

2022年2月9日 15:04

I have a bunch of small neural networks (say, 5 to 50 feed-forward neural networks with only two hidden layers with 10-100 neurons each), which differ only in the weight initialization. I want to train them all on the same, smallish dataset (say, 10K rows), with a batch size of 1. The aim of this is to combine them into an ensemble by averaging the results.

Now, of course I can build the whole ensemble as one neural network in TensorFlow/Keras, like this:

def bagging_ensemble(inputs: int, width: int, weak_learners: int):
    r'''Return a generic dense network model

    inputs: number of columns (features) in the input data set
    width: number of neurons in the hidden layer of each weak learner
    weak_learners: number of weak learners in the ensemble
    '''
    assert width = 1, 'width is required to be at least 1'
    assert weak_learners = 1, 'weak_learners is required to be at least 1'

    activation = tf.keras.activations.tanh
    kernel_initializer = tf.initializers.GlorotUniform()

    input_layer = tf.keras.Input(shape=(inputs,))
    layers = input_layer
    hidden = tf.keras.layers.Dense(units=width, activation=activation, kernel_initializer=kernel_initializer)\
                (input_layer)
    hidden = []
    # add hidden layer as a list of weak learners
    for i in range(weak_learners):
        weak_learner = tf.keras.layers.Dense(units=width, activation=activation, kernel_initializer=kernel_initializer)\
                        (input_layer)
        weak_learner = tf.keras.layers.Dense(units=1, activation=tf.keras.activations.sigmoid)(weak_learner)
        hidden.append(weak_learner)

    output_layer = tf.keras.layers.Average()(hidden)  # add an averaging layer at the end

    return tf.keras.Model(input_layer, output_layer)      

example_model = bagging_ensemble(inputs=30, width=10, weak_learners=5)
tf.keras.utils.plot_model(example_model)

The resulting model's plot looks like this:

However, training the model is slow, and because of the batch size of 1, a GPU doesn't really speed up the process. How can I make better use of the GPU when training such a model in TensorFlow 2, without using a larger batch size?

[The motivation for using this kind of ensemble is the following: Especially with small datasets, each of the small networks will yield different results because of different random initializations. By bagging as demonstrated here, the quality of the resulting model is greatly enhanced. If you're interested in the thorough neural network modelling approach this technique comes from, look for the works of H. G. Zimmermann.]

Topic training keras tensorflow ensemble-modeling neural-network

Category Data Science

Simon Krannig · Accepted Answer · 2019年9月4日 10:25

I have a problem with this approach, though I'm up for discussion about it and might be wrong.

Idea: You should train every sub-model on the main task, not the ensemble. For that, you can utilise your GPU to actually calculate everything in parallel (Even with a 5-dimensional input, effectively encoding 5 batches at once, further enhancing the independent training of the sub-models). To do this, my quick bet would be to just have five independent outputs, all optimised with a weight of 1/5th on the same loss you use for the averaging function. Then, do the averaging afterwards, without further optimisation.

Motivation: The fact that you have the averaging layer inside of the model as the output layer means, that the loss differentiation takes place there as well. This means, that during training, you allow interdependancies inbetween the single sub-networks. After short consideration it should be clear that this is not really what your motivation suggests:

[The motivation for using this kind of ensemble is the following: Especially with small datasets, each of the small networks will yield different results because of different random initializations.

The small networks don't yield anything, because you don't measure and optimise for their yield. In short, you just built a bigger model with a slightly more complicated connectivity.

Training an ensemble of small neural networks efficiently in TensorFlow 2

About