How can i get the vector of word using BERT?

I need to get word-vectors using BERT and got this function that i think it should be the one i need

def get_bert_embed_matrix(sentences):
    device = torch.device(cuda:0 if torch.cuda.is_available() else cpu)
    model_config = transformers.AutoConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
    model = transformers.AutoModel.from_pretrained('bert-base-uncased', config=model_config)
    tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-uncased')  
   for i in sentences:
        tokenized_text = tokenizer.tokenize(i)
        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)        
        tokens_tensor = torch.tensor([indexed_tokens])
        model.eval()
        outputs = model(tokens_tensor)
        hidden_states = outputs[2]
        word_embed_6 = torch.cat([hidden_states[i] for i in [-1,-2,-3,-4]], dim=-1)
    return word_embed_6

Does the method return vectors for sub-word or word ?

Topic bert representation word-embeddings nlp

Category Data Science


About the first piece of code you posted:

At least from the apparent behavior, I would say your code computes the average of all subword vectors in a sentence, not for each word.

To compute word-level representations, you should average only the subwords belonging to a specific word, not all subwords in the sentence.

As a side note, I would suggest not to reuse variable names, as it makes the code confusing. In your code, you reuse i.


About the second piece of code you posted:

It seems to add up the subword embeddings of each word (only the last BERT layer) and concatenate each resulting vector into a tensor for the whole sentence (whose length would be the number of words).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.