How to i get word embeddings for out of vocabulary words using a transformer model?
When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because embeddings is an out of vocabulary word/token, that is being split into em,bed,ding,s.
I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors.
from transformers import AutoTokenizer, AutoModel
# download and load model
tokenizer = AutoTokenizer.from_pretrained(emilyalsentzer/Bio_ClinicalBERT)
model = AutoModel.from_pretrained(emilyalsentzer/Bio_ClinicalBERT)
sentences = ['This framework generates embeddings for each input sentence']
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
print(encoded_input['input_ids'].shape)
Output : torch.Size([1, 13])
for token in encoded_input['input_ids'][0]:
print(tokenizer.decode([token]))
Output:
[CLS]
this
framework
generates
em
##bed
##ding
##s
for
each
input
sentence
[SEP]
Topic huggingface transformer tokenization stanford-nlp nlp
Category Data Science