How to precompute one sequence in a sequence-pair task when using BERT?
BERT uses separator tokens ([SEP]) to input two sequences for a sequence-pair task. If I understand the BERT architecture correctly, attention is applied to all inputs thus coupling the two sequences right from the start.
Now, consider a sequence-pair task in which one of the sequences is constant and known from the start. E.g. Answering multiple unknown questions about a known context. To me it seems that there could be a computational advantage if one would precompute (part of) the model with the context only. However, if my assumption is correct that the two sequences are coupled from the start, precomputation is infeasible.
Therefore my question is: How to precompute one sequence in a sequence-pair task while still using (pre-trained) BERT? Can we combine BERT with some other type of architecture to achieve this? And does it even make sense to do it in terms of speed and accuracy?
Topic bert tokenization deep-learning nlp
Category Data Science