Train consistent embeddings using text from different domains

Question

Train consistent embeddings using text from different domains

nik

2021年6月13日 16:45

I would like to train text embeddings using texts from two different domains (podcast summaries and movie summaries). The embeddings should capture similarities on topics the texts talk about, but ignore as much as possible the style the texts were written in.

The embeddings I currently train using the universal multilingual sentence encoder clearly divide between the domains, which brings quite some distance between two documents that contain strong topic similarity but were written in a different style.

I tried to find the embedding dimensions which divide the documents by domain most clearly and remove those, in order to bring elements from different domains closer to each other. This did not help much, as the information seems to be somehow contained in too many dimensions.

How could I train an embedding that limits the influence of the different domains to a few dimensions, so that the others could be used to find similar documents regarding a topic. I would prefer ideas that modify the trained embeddings over those, that try to remove the text difference before training the models (such as removing words that are more prominent in interested in post processing the model embeddings.

Topic embeddings domain-adaptation nlp

Category Data Science

Train consistent embeddings using text from different domains

About