does ValueError: 'rat' is not in list means not exist in tokenizer

Question

does ValueError: 'rat' is not in list means not exist in tokenizer

Begnnier

2022年1月19日 23:09

Does this error means that the word doesn't exist in the tokenizer

return sent.split( ).index(word)
ValueError: 'rat' is not in list

the code sequences like

def sentences():
   for sent in sentences:
       token = tokenizer.tokenize(sent)
       for i in token :
           idx = get_word_idx(sent,i)
def get_word_idx(sent: str, word: str):
    return sent.split( ).index(word)

sentences split returns ['long', 'restaurant', 'table', 'with', 'rattan', 'rounded', 'back', 'chairs'] which rattan here is the problem as i think

Topic bert tokenization word-embeddings nlp

Category Data Science

Erwan · Accepted Answer · 2022年1月19日 23:09

First, a tokenizer doesn't have a dictionary of predefined words, so anyway it doesn't make sense to "add a new token" to a tokenizer.

Instead it uses indications in the text in order to separate the tokens. The most common indication is of course a whitespace character " ", but there are lots of cases where it's more complex than that. This is why there would be many cases where the second method with sent.split(" ").index(word) would not return the same tokens (punctuation marks, for example).

Also the tokenizer doesn't change the text, so if the sentence contains the word rattan it cannot transform it into the word rat. Why are you testing this? Btw rattan is a real word, in case this is the issue.

does ValueError: 'rat' is not in list means not exist in tokenizer

About