How to precompute one sequence in a sequence-pair task when using BERT?

BERT uses separator tokens ([SEP]) to input two sequences for a sequence-pair task. If I understand the BERT architecture correctly, attention is applied to all inputs thus coupling the two sequences right from the start.

Now, consider a sequence-pair task in which one of the sequences is constant and known from the start. E.g. Answering multiple unknown questions about a known context. To me it seems that there could be a computational advantage if one would precompute (part of) the model with the context only. However, if my assumption is correct that the two sequences are coupled from the start, precomputation is infeasible.

Therefore my question is: How to precompute one sequence in a sequence-pair task while still using (pre-trained) BERT? Can we combine BERT with some other type of architecture to achieve this? And does it even make sense to do it in terms of speed and accuracy?

Topic bert tokenization deep-learning nlp

Category Data Science


Each token position at each of the attention layers of BERT is computed taking into account all tokens of both sequences. This way, there is not a single element that depends on just the first sequence and therefore it is not possible to precompute anything to be reused for different second sequences.

enter image description here

As you can see, the very nature of BERT's network architecture prevents you from factoring out the computations involving the first sequence.

In other similar architectures like ALBERT, there are some parts that could be reused, as the embedding computation (because ALBERT's embeddings are factorized, making the embedding matrix smaller but adding a multiplication at runtime), but I am not sure that reusing this computation would save a lot.

I don't know of any architecture made for sequence pairs that would let you do what you described, as most sequence pair approaches derive from BERT, which itself relies on computing attention between every token pair.

One option would be to use a network that gives you a fixed-size representation (i.e. a vector) of a sentence: you would use it on each of the sentences in a pair, and then you would feed both vectors to a second network (e.g. a multilayer perceptron receiving the concatenation of both vectors) to compute the final output. Depending on the task, this may give you good results and allow you to do the mentioned precomputing. To obtain the sentence representation vector, you may use BERT itself (the output at the [CLS] position) or some other architecture like LASER.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.