Self-attention mechanism did not improve the LSTM classification model

Question

Self-attention mechanism did not improve the LSTM classification model

Leo

2022年6月1日 20:01

I am doing an 8-class classification using time series data.

It appears that the implementation of the self-attention mechanism has no effect on the model so I think my implementations have some problem. However, I don't know how to use the keras_self_attention module and how the parameters should be set.

The question is how to utilize keras_self_attention module for such a classifier.

The first confusion matrix is 2 layers of LSTM.

   
    lstm_unit = 256
    
    model = tf.keras.models.Sequential()
    model.add(Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(Bidirectional(LSTM(lstm_unit, dropout=dropout,return_sequences=True)))

    model.add(Bidirectional(LSTM(lstm_unit, dropout=dropout,return_sequences=True)))

    model.add(keras.layers.Flatten())
    model.add(Dense(num_classes, activation='softmax'))

The second confusion matrix is 2 lSTM + 2 self-attention.

    lstm_unit = 256
    
    model = tf.keras.models.Sequential()
    model.add(Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(Bidirectional(LSTM(lstm_unit, dropout=dropout,return_sequences=True)))

    model.add(Bidirectional(LSTM(lstm_unit, dropout=dropout,return_sequences=True)))

    model.add(SeqSelfAttention(attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL,
attention_activation='sigmoid'))

   

    model.add(keras.layers.Flatten())
    model.add(Dense(num_classes, activation='softmax'))

I have further tried different functions from the module, such as

1.MultiHead

    model.add(MultiHead(Bidirectional(LSTM(units=32)), layer_num=10, name='Multi-LSTMs'))

Residual connection

 inputs = Input(shape=(X_train.shape[1],X_train.shape[2]))
    x = Masking(mask_value=0.0)(inputs)

    x2 = SeqSelfAttention(attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL,
    attention_activation='sigmoid')(x)

    x = x + x2

    x = Bidirectional(LSTM(lstm_unit, dropout=dropout,return_sequences=True))(x)
    x = Flatten()(x)
    
    output = Dense(num_classes, activation='softmax')(x)

    model = Model(inputs=inputs, outputs=output)

But they are more or less the same, no much effect on the MAR MAP and ACC.

I have 2 Titan Xp so computation power is less problem for me, is there a way to make the model more accurate?

Topic deep-learning machine-learning

Category Data Science

Darren Cook · Accepted Answer · 2021年3月16日 10:00

appears that the implementation of the self-attention mechanism has no effect on the model so I think my implementations have some problem.

The first thing I noticed is your base model is doing badly, with low F1 scores, especially for the classes 1..7. Could it be there is just not that much signal in the training data, and the 2-layer LSTM has already sucked it all out?

My second thought is that self-attention is good at finding connections between items however far apart they are in the sequence. But an LSTM is also fairly good at that. So maybe more of the same is not what you need. Self-attention is an essential part of a transformer, because it is the only component that works across the sequence; the only other component is the FFN, which operates on each item in isolation.

Having got those two things off my chest, one thing I've noticed in experiments (and, again, these are in the context of transformers) is the importance of the residual connection. If I take out the residual connection that sits between self-attention and the FFN I get a significantly worse model. (Even though I still have the residual connection across the whole of each layer.)

So you could try adding a residual connection across the self-attention.

The other idea is to move self-attention earlier. Put it before the LSTMs, rather than after them. (Again with a residual connection across it.)

Here is the idea, in (untested) code. I've moved the self attention module to before the first LSTM layer, and then take the sum of the input and the output - that is what makes the residual connection across it.

...

x = Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))

x2 = SeqSelfAttention(attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL,
attention_activation='sigmoid')(x)

x = x + x2

x = Bidirectional(LSTM(lstm_unit, dropout=dropout,return_sequences=True))(x)

...

Note that you can't use the Keras sequential class (because the data flow is no longer sequential).

Ref: https://keras.io/examples/nlp/text_classification_with_transformer/ The residual part is hard to spot at first, as it is done as the input to the two layer norm calls.

Self-attention mechanism did not improve the LSTM classification model

About