Do we really need <unk> tokens?

Question

Do we really need <unk> tokens?

G. Ramistella

2021年9月9日 04:47

I am wondering, do we really need unk tokens? Why do we limit our vocabulary?

Is it for speed? Accuracy?

If we disable all limitations, what do you predict happens?

Topic sequence-to-sequence lstm machine-translation machine-learning

Category Data Science

Professor · Accepted Answer · 2021年9月9日 04:47

As it is already mentioned in the comments, in tokenizing and NLP when you see UNK token, it is to indicate unknown word with a hight chance.

for example, if you want to predict a missing word in a sentence. how would you feed your data to it? you definitely need a token for showing that where is the missing word. so if the "house" is our missing word, after tokenizing it will be like:

'my house is big' -> ['my', 'UNK', 'is', 'big']

n1k31t4 · Accepted Answer · 2018年6月20日 22:18

The <unk> tags can simply be used to tell the model that there is stuff, which is not semantically important to the output. This is a choice made via the selection of a dictionary. If the word is not in the dictionary we have chosen, then we are saying we have no valid representation for that word (or we are simply not interested).

Other tags are commonly used to groups such thing together, not only (j)unk.

For example <EMOJI> might replace any token that is found in our list of deined emojis. We are keeping some information, i.e. that there is a symbol representing emotion of some kind, but we are neglecting exactly which emotion. You can think of many more examples where this might be helpful, or you just don't have the right (labelled) data to make the most of the contents semantically.

Do we really need <unk> tokens?

About