How to get sentence embedding using BERT?

How to get sentence embedding using BERT?

from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')
sentence='I really enjoyed this movie a lot.'
#1.Tokenize the sequence:
tokens=tokenizer.tokenize(sentence)
print(tokens)
print(type(tokens))

2. Add [CLS] and [SEP] tokens:

tokens = ['[CLS]'] + tokens + ['[SEP]']
print( Tokens are \n {} .format(tokens))

3. Padding the input:

T=15
padded_tokens=tokens +['[PAD]' for _ in range(T-len(tokens))]
print(Padded tokens are \n {} .format(padded_tokens))
attn_mask=[ 1 if token != '[PAD]' else 0 for token in padded_tokens  ]
print(Attention Mask are \n {} .format(attn_mask))

4. Maintain a list of segment tokens:

seg_ids=[0 for _ in range(len(padded_tokens))]
print(Segment Tokens are \n {}.format(seg_ids))

5. Obtaining indices of the tokens in BERT’s vocabulary:

sent_ids=tokenizer.convert_tokens_to_ids(padded_tokens)
print(senetence idexes \n {} .format(sent_ids))
token_ids = torch.tensor(sent_ids).unsqueeze(0) 
attn_mask = torch.tensor(attn_mask).unsqueeze(0) 
seg_ids   = torch.tensor(seg_ids).unsqueeze(0)

Feed them to BERT

hidden_reps, cls_head = bert_model(token_ids, attention_mask = attn_mask,token_type_ids = seg_ids)
print(type(hidden_reps))
print(hidden_reps.shape ) #hidden states of each token in inout sequence 
print(cls_head.shape ) #hidden states of each [cls]

output:
hidden_reps size 
torch.Size([1, 15, 768])

cls_head size
torch.Size([1, 768])

Which vector represents the sentence embedding here? Is it hidden_reps or cls_head ?

Is there any other way to get sentence embedding from BERT in order to perform similarity check with other sentences?

Topic bert pytorch tensorflow nlp

Category Data Science


This is an excellent guide on using sentence/text embedding for similarity measure. Important : BERT does not define sentence level - so basically anything between [CLS] and [SEP] is a piece of text for which you can use output embedding.

https://github.com/VincentK1991/BERT_summarization_1/blob/master/notebook/Primer_to_BERT_extractive_summarization_March_25_2020.ipynb

This approach uses [CLS] token value for 768 dimension or basically the cls_head in your question.

As S-BERT is mentioned earlier , it contends taking [CLS] token's embedding does not work very well for text matching , natural language inference etc. They finetune BERT on a loss objective , such that sentences which entail one another has higher similarity score and they use mean pooling of token embedding rather than taking the [CLS]. In the end you will get the same result - a vector of [1,768].

Following link has an excellent tutorial for this - https://www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/


For anyone coming to this question from Google, I'll share my experience with building sentence embeddings. With a standard Bert Model you have three options:

  • CLS: You take the first vector of the hidden_state, which is the token embedding of the classification [CLS] token
  • Mean pooling: Take the average value across each dimension in the 512 hidden_state embeddings, making sure to exclude [PAD] embeddings
  • Max pooling: Take the max value across each dimension in the 512 hidden_state embeddings, again exclude [PAD]

If you're using the standard BERT, mean pooling or CLS are your best bets, both have worked for me in the past.

However, there are BERT models that have been fine-tuned specifically for creating sentence embeddings. They're called sentence transformers and one of the easiest ways to use one of these is via the sentence-transformers library.

Generally these models use the mean pooling approach, but have been fine-tuned to produce good sentence embeddings, and they far outperform anything a standard Bert Model could do.

If you wanted to fine-tune your own BERT/other transformer, most of the current state-of-the-art models are fine-tuned using Multiple Negatives Ranking loss (ps I wrote that article). For this the model learns to distinguish between similar sentence pairs, and after a pretty short training session (just over an hour for me on RTX 3090) you can produce a good quality sentence transformer model.

That being said, there are already many great pretrained models out there, there's a list of some of the better models here, although it isn't fully up to date - for example, flax-sentence-embeddings/all_datasets_v3_mpnet-base performs better on benchmarks than any of those listed.


bert-as-service provides a very easy way to generate embeddings for sentences.

It is explained very well in the bert-as-service repository:

Installations:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Download one of the pre-trained models available at here.

Start the service:

bert-serving-start -model_dir /your_model_directory/ -num_worker=4 

Generate the vectors for the list of sentences:

from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)

There is very cool tool called bert-as-service which does the job for you. It maps a sentence to a fixed length word embeddings based on the pre trained model you use. It also allows a lot of parameter tweaking which is covered extensively in the documentation.


There is actually an academic paper for doing so. It is called S-BERT or Sentence-BERT.
They also have a github repo which is easy to work with.


In your example, the hidden state corresponding to the first token ([CLS]) in hidden_reps can be used as a sentence embedding.

By contrast, the pooled output (mistakenly referred to as hidden states of each [cls] in your code) proved a bad proxy for a sentence embedding in my experiments.


Which vector represents the sentence embedding here? Is it hidden_reps or cls_head?

If we look in the forward() method of the BERT model, we see the following lines explaining the return types:

outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]  # add hidden_states and attentions if they are here
return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)

So the first element of the tuple is the "sentence output" - each token in the input is embedded in this tensor. In your example, you have 1 input sequence, which was 15 tokens long, and each token was embedding into a 768-dimensional space.

The second element of the tuple is the "pooled output". You'll notice that the "sequence" dimension has been squashed, so this represents a pooled embedding of the input sequence.

So they both represent the sentence embedding. You can think of hidden_reps as a "verbose" representation, where each token has been embedded. You can think of cls_head as a condensed representation, where the entire sequence has been pooled.

Is there any other way to get sentence embedding from BERT in order to perform similarity check with other sentences?

Using the transformers library is the easiest way I know of to get sentence embeddings from BERT.

There are, however, many ways to measure similarity between embedded sentences. The simplest approach would be to measure the Euclidean distance between the pooled embeddings (cls_head) for each sentence.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.