How to JUST represent words as embeddings by pretrained BERT?

I don't have enough data (i.e. I don't have enough texts) --- have only around 4k words in my dictionary. I need to compare given words, then I need to representate it as embedding.

After the representation of words I want to clusterize it, find similar vectors (i.e. words). Maybe even then make a classification to a given classes (classification there unsupervised --- since I don't have labeled data to train on).

I know that almost any task can be solved inside BERT, i.e. using fine-tuning in final layer.

Since all described above, I have two QUESTIONS; answers/hints/anything really appreciated since i'm stuck on that:

  1. How to just extract embeddings from BERT using some dictionary of words and use word representations for futher work?
  2. Can we solve inside BERT using fine-tuning the next problem: a). Load dictionary of words into BERT b). Load given classes (words representing each class. E.g. fashion, nature). c) Make an unsupervised classification task?

Topic bert representation unsupervised-learning word-embeddings nlp

Category Data Science


With regard to a dictionary of words, there can be no single dictionary for BERT because the BERT embeddings incorporate contextual information (i.e. the surrounding words in the sentence change the embedding for your target word). In theory, you could construct a dictionary for your words by passing single word sentences (though a single word may be broken down into multiple tokens).

If you're looking for an easy practical way to get the pretrained Bert embeddings, HuggingFace makes it easy.

I have given a simple code snippet below using python and specifically pytorch:

from transformers import BertTokenizer, BertModel
import torch

my_sentence = "Whatever your sentence is"
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

input_ids = tokenizer(my_sentence, return_tensors="pt")
output = model(**input_ids)

final_layer = output.last_hidden_state

The final_layer tensor will now hold the embeddings (768 dimensional) for each token in your input sentence. Note that the zeroth token is a start token (CLS) and the last token is an end token.

If you have a list of sentences (of single words in your case perhaps if you are making a dictionary), you can use the above code in a batched manner. However, you will need to extract a mask from the tokenizer and pass this to your model (to account for different length sentences).

Example:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encoded_inputs = tokenizer(list_of_sentences, padding=True, 
                            truncation=True, return_tensors="pt")
ids = encoded_inputs['input_ids']
mask = encoded_inputs['attention_mask']

output = model(ids, mask)
final_layer = output.last_hidden_state

  1. BERT does not give word representations, but subword representations (see this). Nevertheless, it is common to average the representations of the subwords in a word to obtain a "word-level" representation.

  2. You may try to handle this as a normal tagging problem, where the tag of each word is the class associated with the word, much like part-of-speech (POS) (e.g. this) tagging or named entity recognition (NER) (e.g. this). Normally, you associate the tag to either the first or the last subword token in the word. If you prepare a dataset that way, you could fine-tune BERT to perform word tagging with the classes you need. If you only have the words, you could find some text corpus (ideally of the intended domains) and apply the described data preparation process.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.