How to i get word embeddings for out of vocabulary words using a transformer model?

Question

How to i get word embeddings for out of vocabulary words using a transformer model?

cerofrais

2022年2月7日 19:04

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because embeddings is an out of vocabulary word/token, that is being split into em,bed,ding,s.

I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors.

from transformers import AutoTokenizer, AutoModel
# download and load model
tokenizer = AutoTokenizer.from_pretrained(emilyalsentzer/Bio_ClinicalBERT)
model = AutoModel.from_pretrained(emilyalsentzer/Bio_ClinicalBERT)

sentences = ['This framework generates embeddings for each input sentence']


#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')


#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

print(encoded_input['input_ids'].shape)

Output : torch.Size([1, 13])

for token in encoded_input['input_ids'][0]:
  print(tokenizer.decode([token]))

Output:

[CLS]
this
framework
generates
em
##bed
##ding
##s
for
each
input
sentence
[SEP]

Topic huggingface transformer tokenization stanford-nlp nlp

Category Data Science

Nitin · Accepted Answer · 2021年4月24日 13:43

I'm not sure there is a need for aggregation, or in other words you may have a pipeline mismatch. BERT sentencepiece tokenization is specifically meant to be passed to some set downstream pipelines, with the aim of the sentencepiece thing being to be able to cater to OOV words. By aggregating the sentencepiece tokens, you might be doing away with the benefit of being able to cater to OOV in your later pipeline.

If you are looking for whole word vector tokens, and want to work with OOV words, I would recommend looking at FastText instead. This algorithm more or less uses subwords, and it will also build tokens for OOV words by pretty much aggregating the subword information for that OOV word during a custom training step. The benefit here is that the aggregation step need not be part of your pipeline, and you can use these new vectors in any downstream task (except, of course, the pipelines that accept BERT sentencepiece tokens)

How to i get word embeddings for out of vocabulary words using a transformer model?

About