Question about computing language modeling loss with multi gpu
When training BERT or GPT or other language model, we use the mean of cross entropy as loss function(don't consider label smoothing). Here B denote for batch size, len denote target length of i-th sentence sequence. $$L = \frac{\sum_{i=0}^{|B|}\sum_{j=0}^{len_i}ce(y_{ij},\hat{y}_{ij})}{\sum_{i=0}^{|B|}{len_i}} .......(1)$$
When use multi gpu, the common forward process is:
- Split data to each gpu;
- Compute loss on each gpu;
- Reduce loss;(Most of time we simply use the mean of loss from all gpus)
Now if we combine those above together, the loss can be written as: $$L = \frac{1}{N}\sum_{n=1}^N{\frac{\sum_{i=0}^{|b|}\sum_{j=0}^{len_{n,i}}ce(y_{n,i,j},\hat{y}_{n,i,j})}{\sum_{i=0}^{|b|}{len_{n,i}}}}......(2)$$ Here N denote for n gpus, b equals Batch size / N, $len_{n,i}$ denote for length of the i-th sentence on n-th gpu.
But if we compute (2) with a single gpu, loss should be written as from (1) : $$L = \frac{\sum_{n=1}^N{\sum_{i=0}^{|b|}\sum_{j=0}^{len_{n,i}}ce(y_{n,i,j},\hat{y}_{n,i,j})}}{{\sum_{n=1}^N}\sum_{i=0}^{|b|}{len_{n,i}}}......(3)$$
(3) is always equal to (1). If all the sentence sequence length is the same, then (1) (2) (3) they are equal. But (2) is not always equal to (1). If target output is a variable sequence. This is often what we see in Bert or GPT:
- Bert use random mask, thus number the target words in each sentence is varying.
- GPT train texts with different length, thus number the target words in each sentence is also varying.
If we want to compute loss as (3), then each gpu should know the sentence length of other gpus.(I think it's a bit like synchronized batch normalization) But I found most use (2) to compute the loss. This may because they sort the sentences by length after shuffle, thus every batch of data can be the same length.
My questions is, did someone can answer my conclusion is right or wrong. (I didn't find answer on any site)
- If the sentence sequence length is the not same, we can't use (2) to compute loss.
- Most implementation use (2) because they sort the training corpus by length.
- (Question) Is it harmful(or beneficial)to sort the training corpus by length after shuffle? (if without considering efficiency). Without sort, it seems more randomized. With sort, it seems more accurate when computing gradient.
Topic bert deep-learning language-model nlp
Category Data Science