does ValueError: 'rat' is not in list means not exist in tokenizer

Does this error means that the word doesn't exist in the tokenizer

return sent.split( ).index(word)
ValueError: 'rat' is not in list

the code sequences like

def sentences():
   for sent in sentences:
       token = tokenizer.tokenize(sent)
       for i in token :
           idx = get_word_idx(sent,i)
def get_word_idx(sent: str, word: str):
    return sent.split( ).index(word)

sentences split returns ['long', 'restaurant', 'table', 'with', 'rattan', 'rounded', 'back', 'chairs'] which rattan here is the problem as i think

Topic bert tokenization word-embeddings nlp

Category Data Science


First, a tokenizer doesn't have a dictionary of predefined words, so anyway it doesn't make sense to "add a new token" to a tokenizer.

Instead it uses indications in the text in order to separate the tokens. The most common indication is of course a whitespace character " ", but there are lots of cases where it's more complex than that. This is why there would be many cases where the second method with sent.split(" ").index(word) would not return the same tokens (punctuation marks, for example).

Also the tokenizer doesn't change the text, so if the sentence contains the word rattan it cannot transform it into the word rat. Why are you testing this? Btw rattan is a real word, in case this is the issue.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.