bert - Geeks Mental

What is the pooled output when using tensorflows implementation of BERT for text classification (multiple sentences)

Blackforest95

2022年6月4日 15:50

I stumbled upon different sources that state that each sentence starts with a CLS token when passed to BERT. I'm passing text documents with multiple sentences to BERT. This would mean that for each sentence, I would have one CLS token. Pooled output is however only returning a vector of size hidden state. Does this mean that all CLS tokens are somehow compressed to one (averaging?)? Or does my text document only contain one single CLS token for the whole …

Topic: bert sentiment-analysis

Category: Data Science

Fine Tuning BERT for text summarization

stu_dent

2022年6月3日 18:26

I was trying to follow this notebook to fine-tune BERT for the text summarization task. Everything was good till I come to this instruction in section Evaluation to evaluate my model: model = EncoderDecoderModel.from_pretrained("checkpoint-500") An error appears: OSError: checkpoint-500 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and …

Topic: huggingface bert finetuning

Category: Data Science

Using BERT for co-reference resolving, what's the loss function?

EyeQ Tech

2022年6月3日 08:00

I'm working my way around using BERT for co-reference resolving. I'm following this highly-cited paper BERT for Coreference Resolution: Baselines and Analysis (https://arxiv.org/pdf/1908.09091.pdf). I have following questions, the details can't be found easily from the paper, hope you guys help me out. What’s the input? is it antecedents + parapraph? What’s the output? clusters <mention, antecedent> ? More importantly What’s the loss function? For comparison, in another highly-cited paper by [Clark .et al] using Reinforcement Learning, it's very clear about …

Topic: bert nlp

Category: Data Science

Using BERT instead of word2vec to extract most similar words to a given word

Maitha Alnaqbi

2022年6月2日 19:59

I am fairly new to BERT, and I am willing to test two approaches to get "the most similar words" to a given word to use in Snorkel labeling functions for weak supervision. Fist approach was to use word2vec with pre-trained word embedding of "word2vec-google-news-300" to find the most similar words @labeling_function() def lf_find_good_synonyms(x): good_synonyms = word_vectors.most_similar("good", topn=25) ##Similar words are extracted here good_list = syn_list(good_synonyms) ##syn_list just returns the stemmed similar word return POSITIVE if any(word in x.stemmed for …

Topic: snorkel bert word2vec nlp

Category: Data Science

Is it possible to add new vocabulary to BERT's tokenizer when fine-tuning?

user123635

2022年6月1日 09:05

I want to fine-tune BERT by training it on a domain dataset of my own. The domain is specific and includes many terms that probably weren't included in the original dataset BERT was trained on. I know I have to use BERT's tokenizer as the model was originally trained on its embeddings. To my understanding words unknown to the tokenizer will be masked with [UNKNOWN]. What if some of these words are common in my dataset? Does it make sense …

Topic: bert finetuning word-embeddings nlp

Category: Data Science

How can I use the embedding generated by mBERT with a CNN or SVM as a classifier?

user18848025

2022年5月30日 13:10

I have a school project and need to use the embeddings generated by BERT, for example, mBERT, and using a classifier like SVM, CNN... Any help, please. Thank you!

Topic: bert transformer cnn word-embeddings nlp

Category: Data Science

Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)

Bloodstone Programmer

2022年5月24日 18:04

I have a set of data that contains the different lengths of sequences. On average the sequence length is 600. The dataset is like this: S1 = ['Walk','Eat','Going school','Eat','Watching movie','Walk'......,'Sleep'] S2 = ['Eat','Eat','Going school','Walk','Walk','Watching movie'.......,'Eat'] ......................................... ......................................... S50 = ['Walk','Going school','Eat','Eat','Watching movie','Sleep',.......,'Walk'] The number of unique actions in the dataset are fixed. That means some sentences may not contain all of the actions. By using Doc2Vec (Gensim library particularly), I was able to extract embedding for each of the sequences …

Topic: doc2vec bert transformer embeddings nlp

Category: Data Science

NLP - support comments analysis

DeGr

2022年5月24日 14:39

I am new to NLP and looking for some direction since after all my reading I haven't found a definite approach and the subject matter is vast. The project is to focus on specific fields of support comments using NLP and Python. The goal is that from the comments I want to verify that the comment is in fact a well made comment for that field. Some requirements is the context of entered text is relevant to the field it …

Topic: bert nlp python

Category: Data Science

Text similarity for badly written text

Ramiro Hum-Sah

2022年5月19日 03:29

Consider the following scenario: Suppose two lists of words $L_{1}$ and $L_{2}$ are given. $L_{1}$ contains just bad-written phrases (like 'age' instead of '4ge' or 'blwe' instead of 'blue' etc.). On the other hand, each element of $L_{2}$ is a well-written version of each element of $L_{1}$. Here is an example: $$L_{1}=[...,dqta \ 5ciencc,...,s7ack \ exch9nge,...],$$ $$L_{2}=[...,stack \ exchange,...,data \ science,...].$$ Problem: Is there any strategy to try to predict which element $w^{\prime}$ in $L_{2}$ is the syntactically correct counterpart …

Topic: bert probability multilabel-classification multiclass-classification nlp

Category: Data Science

Is it possible feed BERT to seq2seq encoder/decoder NMT (for low resource language)?

NLP Dude

2022年5月19日 03:02

I'm working on NMT model which the input and the target sentences are from the same language (but the grammar differs). I'm planning to pre-train and use BERT since I'm working on small dataset and low/under resource language. so is it possible to feed BERT to the seq2Seq encoder/decoder?

Topic: bert sequence-to-sequence deep-learning machine-translation machine-learning

Category: Data Science

How to get sentiment score for a word in a given dataset

Dipto_Das

2022年5月15日 05:08

I have a sentiment analysis dataset that is labeled in three categories: positive, negative, and neutral. I also have a list of words (mostly nouns), for which I want to calculate the sentiment value, to understand "how" (positively or negatively) these entities were talked about in the dataset. I have read some online resources like blogs and thought about a couple of approaches for calculating the sentiment score for a particular word X. Calculate how many data instances (sentences) which …

Topic: bert sentiment-analysis dataset nlp

Category: Data Science

Loading a Model with weights and optimizers without creating an instance in PyTorch

Übermensch

2022年5月14日 18:03

I recently downloaded Camembert Model to fine-tune it for my purpose. Upon unzipping the file the contents are: Upon loading the model.pt file using pytorch: import torch model = torch.load(model_saved_at) I saw that model was in OrderedDict format containing the following keys: args model optimizer_history extra_state last_optimizer_state As the name suggests most of them are OrderedKeys themselves with the exception of args which belongs to a class argsparse.Namespace. Using vars() we can see args only contains some hyperparameters and values …

Topic: bert pytorch nlp

Category: Data Science

HuggingFace Transformers is giving loss: nan - accuracy: 0.0000e+00

JasonExcel

2022年5月14日 00:02

I am a HuggingFace Newbie and I am fine-tuning a BERT model (distilbert-base-cased) using the Transformers library but the training loss is not going down, instead I am getting loss: nan - accuracy: 0.0000e+00. My code is largely per the boiler plate on the [HuggingFace course][1]:- model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3) opt = Adam(learning_rate=lr_scheduler) model.compile(optimizer=opt, loss=loss, metrics=['accuracy']) model.fit( encoded_train.data, np.array(y_train), validation_data=(encoded_val.data, np.array(y_val)), batch_size=8, epochs=3 ) Where my loss function is:- loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) The learning rate is calculated like so:- lr_scheduler …

Topic: loss huggingface bert nlp

Category: Data Science

How to optimize hyperparameters in Bert?

PicaR

2022年5月13日 15:27

I am using the BERT model in order to classify stereotypes in sentences. I wanted to know if there is a way to automate the optimization of hyperparameters such as 'epochs', 'batchs' or 'learning rate' with some function that is similar to 'GridSearchCV' (I don't know if this function can be used in the BERT model, if it can be used let me know) so I don't have to test combinations of values 'by hand'. I attach part of my …

Topic: bert transformer deep-learning nlp python

Category: Data Science

How to get sentence embedding using BERT?

Aj_MLstater

2022年5月12日 16:05

How to get sentence embedding using BERT? from transformers import BertTokenizer tokenizer=BertTokenizer.from_pretrained('bert-base-uncased') sentence='I really enjoyed this movie a lot.' #1.Tokenize the sequence: tokens=tokenizer.tokenize(sentence) print(tokens) print(type(tokens)) 2. Add [CLS] and [SEP] tokens: tokens = ['[CLS]'] + tokens + ['[SEP]'] print(" Tokens are \n {} ".format(tokens)) 3. Padding the input: T=15 padded_tokens=tokens +['[PAD]' for _ in range(T-len(tokens))] print("Padded tokens are \n {} ".format(padded_tokens)) attn_mask=[ 1 if token != '[PAD]' else 0 for token in padded_tokens ] print("Attention Mask are \n {} ".format(attn_mask)) …

Topic: bert pytorch tensorflow nlp

Category: Data Science

BERT base uncased required gpu ram

Gius

2022年5月12日 08:58

I'm working on an NLP task, using BERT, and I have a little doubt about GPU memory. I already made a model (using DistilBERT) since I had out-of-memory problems with tensorflow on a RTX3090 (24gb gpu's ram, but ~20.5gb usable) with BERT base model. To make it working, I limited my data to 1.1 milion of sentences in training set (truncating sentences at 128 words), and like 300k in validation, but using an high batch size (256). Now I have …

Topic: bert transformer gpu

Category: Data Science

BertTokenizer on custom data returns same index for all tokens

lazarea

2022年5月11日 13:53

I'm trying to train Bert tokenizer on a custom dataset but when running tokenizer.tokenize on sample data, it returns the same index for every tokens which is clearly not what is expected. Running bert_vocab_from_dataset on the below sample dataset returns a vocabulary of 88 tokens long. After saving this and reusing it in tensorflow_text.BertTokenizer, I get [88] for all the tokens of the provided two test sentences. Fully reproducible example code: import tensorflow as tf import tensorflow_text from pathlib import …

Topic: bert transformer tokenization preprocessing nlp

Category: Data Science

How Transformer is Bidirectional - Machine Learning

Sandeep Bhutani

2022年5月1日 20:05

Asking question in datascience forum, as this forum seems well suited for data science related questions: https://stackoverflow.com/questions/55158554/how-transformer-is-bidirectional-machine-learning/55158766?noredirect=1#comment97066160_55158766 I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied. Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood …

Topic: bert transformer machine-learning

Category: Data Science

How do i generate text from ids in Torchtext's sentencepiece_numericalizer?

Fhunmie

2022年4月29日 07:52

The torchtext sentencepiece_numericalizer() outputs a generator with indices SentencePiece model corresponding to token in the input sentence. From the generator, I can get the ids. My question is how do I get the text back after training? For example >>> sp_id_generator = sentencepiece_numericalizer(sp_model) >>> list_a = ["sentencepiece encode as pieces", "examples to try!"] >>> list(sp_id_generator(list_a)) [[9858, 9249, 1629, 1305, 1809, 53, 842], [2347, 13, 9, 150, 37]] How do I convert list_a back t(i.e "sentencepiece encode as pieces", "examples to …

Topic: bert transformer pytorch nlp python

Category: Data Science

Predicting word from a set of words

Oren Matar

2022年4月26日 20:01

My task is to predict relevant words based on a short description of an idea. for example "SQL is a domain-specific language used in programming and designed for managing data held in a relational database" should produce words like "mysql", "Oracle", "Sybase", "Microsoft SQL Server" etc... My thinking is to treat the initial text as a set of words (after lemmatization and stop words removal) and predict words that should be in that set. I can then take all of …

Topic: bert word2vec word-embeddings neural-network nlp

Category: Data Science

About