As it is already mentioned in the comments, in tokenizing and NLP when you see UNK token, it is to indicate unknown word with a hight chance.

for example, if you want to predict a missing word in a sentence. how would you feed your data to it? you definitely need a token for showing that where is the missing word. so if the "house" is our missing word, after tokenizing it will be like:

'my house is big' -> ['my', 'UNK', 'is', 'big']


The <unk> tags can simply be used to tell the model that there is stuff, which is not semantically important to the output. This is a choice made via the selection of a dictionary. If the word is not in the dictionary we have chosen, then we are saying we have no valid representation for that word (or we are simply not interested).

Other tags are commonly used to groups such thing together, not only (j)unk.

For example <EMOJI> might replace any token that is found in our list of deined emojis. We are keeping some information, i.e. that there is a symbol representing emotion of some kind, but we are neglecting exactly which emotion. You can think of many more examples where this might be helpful, or you just don't have the right (labelled) data to make the most of the contents semantically.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.