Transformer similarity fine-tuned way too often predicts pairs as similar

Question

Transformer similarity fine-tuned way too often predicts pairs as similar

Simone

2022年2月18日 06:48

I fine-tuned a transformer for classification to compute similarity between names. This is a toy example for the training data:

name0 name1 label
Test  Test  y
Test  Hi    n

I fined-tuned the transformer using the label and feeding it with pairs of names as its tokenizer allows to feed 2 pieces of text.

I found a really weird behavior. At prediction times, there exist pairs that have very high chances to be predicted as similar just because they have repeated words. For example,

name0        name1       label
Hi Hi Hi     dsfds       ?

has a high chance to be predicted as y!

In general there exist some names that no matter what you pair them with, the pairs gets predicted as y.

Did anyone notice this behavior? Is it because I am fine-tuning on about 1000 examples?

At the moment, I am trying to augment my data with:

Empty names
Random chars (always the same)

E.g.

name0 name1 label
Test        n
      Test  n
Test  dsfsd n
dsfsd Test  n

Unfortunately, I still see the same behavior.

Topic huggingface transformer finetuning classification similarity

Category Data Science

Gary Ong · Accepted Answer · 2022年2月18日 06:48

For NLP related tasks the transformer tries it's best to match your output distribution but as with all ml tasks it will fail on some parts of your data. Your task is somewhat similar to bert's next sentence prediction. Ensure that you use the [SEP] token to separate the names.

Transformer similarity fine-tuned way too often predicts pairs as similar

About