Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

Question

Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

Bloodstone Programmer

2022年5月24日 18:04

I have a set of data that contains the different lengths of sequences. On average the sequence length is 600. The dataset is like this:

S1 = ['Walk','Eat','Going school','Eat','Watching movie','Walk'......,'Sleep']
S2 = ['Eat','Eat','Going school','Walk','Walk','Watching movie'.......,'Eat']
.........................................
.........................................
S50 = ['Walk','Going school','Eat','Eat','Watching movie','Sleep',.......,'Walk']

The number of unique actions in the dataset are fixed. That means some sentences may not contain all of the actions.

By using Doc2Vec (Gensim library particularly), I was able to extract embedding for each of the sequences and used that for later task (i.e., clustering or similarity measure)

As transformer is the state-of-the-art method for NLP task. I am thinking if Transformer-based model can be used for similar task. While searching for this technique I came across the sentence-Transformer. But it uses a pretrained BERT model (which is probably for language but my case is not related to language) to encode the sentences. Is there any way I can get embedding from my dataset using Transformer-based model?

Topic doc2vec bert transformer embeddings nlp

Category Data Science

Kasra Manshaei · Accepted Answer · 2021年4月25日 19:59

Yes you can get the embedding of words. In sentence-transformers you need to see where and how the embedding of words are extracted. Usually the sentence embedding in BERT is simple a max or mean over all word embeddings, so BERT does have word embeddings.

The question is if it helps you at all as, like you mentioned, at the end of the day you have a bunch of words not a language. The point is that if you have enough data you can fine-tune BERT (see sentence-transformers documentation) and probably capture the difference-similarity between each sample.

Another way is to get your embedding trough LSTM which is more suitable for such sequences and you can train it using for example Keras.

I suggest trying SGT as well and compare its embedding capability with other methods so you can choose the best.

Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

About