How to get sentence embedding using BERT?
How to get sentence embedding using BERT?
from transformers import BertTokenizer
sentence='I really enjoyed this movie a lot.'
#1.Tokenize the sequence:
2. Add [CLS] and [SEP] tokens:
tokens = ['[CLS]'] + tokens + ['[SEP]']
print( Tokens are \n {} .format(tokens))
3. Padding the input:
padded_tokens=tokens +['[PAD]' for _ in range(T-len(tokens))]
print(Padded tokens are \n {} .format(padded_tokens))
attn_mask=[ 1 if token != '[PAD]' else 0 for token in padded_tokens ]
print(Attention Mask are \n {} .format(attn_mask))
4. Maintain a list of segment tokens:
seg_ids=[0 for _ in range(len(padded_tokens))]
print(Segment Tokens are \n {}.format(seg_ids))
5. Obtaining indices of the tokens in BERT’s vocabulary:
print(senetence idexes \n {} .format(sent_ids))
token_ids = torch.tensor(sent_ids).unsqueeze(0)
attn_mask = torch.tensor(attn_mask).unsqueeze(0)
seg_ids = torch.tensor(seg_ids).unsqueeze(0)
Feed them to BERT
hidden_reps, cls_head = bert_model(token_ids, attention_mask = attn_mask,token_type_ids = seg_ids)
print(hidden_reps.shape ) #hidden states of each token in inout sequence
print(cls_head.shape ) #hidden states of each [cls]
hidden_reps size
torch.Size([1, 15, 768])
cls_head size
torch.Size([1, 768])
Which vector represents the sentence embedding here? Is it hidden_reps
or cls_head
Is there any other way to get sentence embedding from BERT in order to perform similarity check with other sentences?
Topic bert pytorch tensorflow nlp
Category Data Science