dealing with HuggingFace's model's tokens

I have a few questions regarding tokenizing word/characters/emojis for different huggingface models.

From my understanding, a model would only perform best during inference if the token of the input sentence are within the tokens that the model’s tokenizer was trained on.

My questions are:

  1. is there a way to easily find out if a particular word/emoji is compatible (included during model training) with the model? (in huggingface context)

  2. if this word/emoji is not was not included during model training, what are the best ways to deal with these words/emojis, such that model inference would give best possible output considering the inclusion of these word/emoji as input. (for 2. it would be nice if it could be answered in the context of my huggingface setup below, if possible)

My current setup is as follows:

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
pre_trained_model = 'facebook/bart-large-mnli'
task = 'zero-shot-classification'
candidate_labels = ['happy', 'sad', 'angry', 'confused']
tokenizer = AutoTokenizer.from_pretrained(pre_trained_model)
model = AutoModelForSequenceClassification.from_pretrained(pre_trained_model)
zero_shot_classifier = pipeline(model=model, tokenizer=tokenizer, task=task)

zero_shot_classifier('today is a good day ', candidate_labels=candidate_labels)

Any help is appreciated

Topic bart huggingface bert tokenization nlp

Category Data Science


For your first question, you can check if the tokenizer covers a certain string with the following:

text = 'today is a good day '
ids2string = lambda ids: tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(ids))
ids2string(tokenizer(text)['input_ids'])
> <s>today is a good day </s>

If emoji is not included in the tokenizer creation, the tokenizer will replace it with the unknown special token. You can access that with tokenizer.special_tokens_map['unk_token']. You can drop or keep them, shouldn't make much of a difference.

Alternatively, if you're going to fine-tune, you can add your own tokens to the existing tokenizer with tokenizer.add_special_tokens. However, in this case the embeddings for that token will be random. And you need to train them.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.