Fine Tuning BERT for text summarization

I was trying to follow this notebook to fine-tune BERT for the text summarization task. Everything was good till I come to this instruction in section Evaluation to evaluate my model: model = EncoderDecoderModel.from_pretrained("checkpoint-500") An error appears: OSError: checkpoint-500 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and …
Category: Data Science

Large jumps in loss in simple transformer model?

As an exercise, I created a very simple transformer model that just sees the same simple batch of dummy data repeatedly and (one would assume) should quickly learn to fit it perfectly. And indeed, training reaches a loss of zero quickly. However I noticed that the loss does not stay at zero, or even close to it: there are occasional large jumps in the loss. The script below counts every time that the loss jumps by 10 or more between …
Category: Data Science

Could Attention_mask in T5 be a float in [0,1]?

I was inspecting T5 model from hf https://huggingface.co/docs/transformers/model_doc/t5 . attention_mask is presented as attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. I was wondering whether it could be used something "softer" not only selecting the not-padding token but also selecting "how much" attention should be used on every token. This question is …
Category: Data Science

HuggingFace Transformers is giving loss: nan - accuracy: 0.0000e+00

I am a HuggingFace Newbie and I am fine-tuning a BERT model (distilbert-base-cased) using the Transformers library but the training loss is not going down, instead I am getting loss: nan - accuracy: 0.0000e+00. My code is largely per the boiler plate on the [HuggingFace course][1]:- model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3) opt = Adam(learning_rate=lr_scheduler) model.compile(optimizer=opt, loss=loss, metrics=['accuracy']) model.fit( encoded_train.data, np.array(y_train), validation_data=(encoded_val.data, np.array(y_val)), batch_size=8, epochs=3 ) Where my loss function is:- loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) The learning rate is calculated like so:- lr_scheduler …
Category: Data Science

Unable to debug where torch Adam optimiser is going wrong

I was implementing a training loop in vscode. I have created a Adam optimizer using XLM-Roberta model as follows: xlm_r_model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels = NUM_LABELS, output_attentions=False, output_hidden_states=False ) xlm_r_model.to(device) optimizer = torch.optim.Adam(xlm_r_model.parameters(), lr=LR) Then at following line: optimizer.step() vscode simply terminates the execution, without any error stack trace. So I debugged to get to know exactly where this is happening. I reached this line, which makes F.adam(...) call: Weirdly, on github, torch.optim.adam does not have this line. It seems that …
Category: Data Science

Conversational model returns empty string after a while

I've been experinmenting with Huggingface models and I've set up a chatbot with DialoGPT. It works pretty well, but after a while it stops answering and just returns empty strings. Before this it will start to give shorter and shorter answers. Any idea what can cause such a behavior? I'm using the medium-sized model with a max_length of 2000 and added a repetition_penalty=1.3, but other than that I didn't change any other parameters. I also add the previous message back …
Category: Data Science

How to train a Task Specific Knowledge Distillation model using Hugging face model

I was referring to this code: https://github.com/philschmid/knowledge-distillation-transformers-pytorch-sagemaker/blob/master/knowledge-distillation.ipynb From @philschmid I could follow most of the code, but had few doubts. Please help me to clarify these doubts. In this code below: class DistillationTrainer(Trainer): def __init__(self, *args, teacher_model=None, **kwargs): super().__init__(*args, **kwargs) self.teacher = teacher_model # place teacher on same device as student self._move_model_to_device(self.teacher,self.model.device) self.teacher.eval() When I take fine-tuned teacher model it is never fine-tuned in the process of Task Specific Distillation training, as in line self.teacher.eval() mentioned in the code.? Only …
Category: Data Science

How to save hugging face fine tuned model using pytorch and distributed training

I am fine tuning masked language model from XLM Roberta large on google machine specs. When I copy the model using gsutil and subprocess from container to GCP bucket it gives me error. Versions Versions torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 transformers==4.17.0 I am using pre-trained Hugging face model. I launch it as train.py file which I copy inside docker image and use vertex-ai ( GCP) to launch it using Containerspec machineSpec = MachineSpec(machine_type="a2-highgpu-4g",accelerator_count=4,accelerator_type="NVIDIA_TESLA_A100") python -m torch.distributed.launch --nproc_per_node 4 train.py --bf16 I am …
Category: Data Science

Should weight distribution change more when fine-tuning transformers-based classifier?

I'm using pre-trained DistilBERT model from Huggingface with custom classification head, which is almost the same as in the reference implementation: class PretrainedTransformer(nn.Module): def __init__( self, target_classes): super().__init__() base_model_output_shape=768 self.base_model = DistilBertModel.from_pretrained("distilbert-base-uncased") self.classifier = nn.Sequential( nn.Linear(base_model_output_shape, out_features=base_model_output_shape), nn.ReLU(), nn.Dropout(0.2), nn.Linear(base_model_output_shape, out_features=target_classes), ) for layer in self.classifier: if isinstance(layer, nn.Linear): layer.weight.data.normal_(mean=0.0, std=0.02) if layer.bias is not None: layer.bias.data.zero_() def forward(self, input_, y=None): X, length, attention_mask = input_ base_output = self.base_model(X, attention_mask=attention_mask)[0] base_model_last_layer = base_output[:, 0] cls = self.classifier(base_model_last_layer) return cls During …
Category: Data Science

Hugging face Model Output 'last_hidden_state'

I am using the Huggingface BERTModel, The model gives Seq2SeqModelOutput as output. The output contains the past hidden states and the last hidden state. These are my questions What is the use of the hidden states? How do I pass my hidden states to my output layer? What I actually want is the output tokens, from the model how do I get the prediction tokens?
Category: Data Science

How to use label smoothing for single label classification in hugging face models

I am training a binary class classification model using Roberta-xlm large model. I am using training data with hard labels as either 1 or 0. Is it advisable to perform label smoothing on this training procedure for hard labels? If so then what would be right way to do. Here is my code: tokenizer = tr.XLMRobertaTokenizer.from_pretrained("/home/scp/AIML/tokenizer_xlm2") train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512, return_tensors="pt") val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512, return_tensors="pt") test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512, return_tensors="pt") class SEDataset(torch.utils.data.Dataset): def …
Category: Data Science

Overfitting in Huggingface's TFBertForSequenceClassification

I'm using Huggingface's TFBertForSequenceClassification for multilabel tweets classification. During training the model archives good accuracy, but the validation accuracy is poor. I've tried to solve the overfitting using some dropout but the performance is still poor. The model is as follows: # Get and configure the BERT model config = BertConfig.from_pretrained("bert-base-uncased", hidden_dropout_prob=0.5, num_labels=13) bert_model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", config=config) optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=0.00015, clipnorm=0.01) loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True) metric = tf.keras.metrics.CategoricalAccuracy('accuracy') bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric]) bert_model.summary() The summary is as follows: When I fit …
Category: Data Science

How to use is_split_into_words with Huggingface NER pipeline

I am using Huggingface transformers for NER, following this excellent guide: https://huggingface.co/blog/how-to-train. My incoming text has already been split into words. When tokenizing during training/fine-tuning I can use tokenizer(text,is_split_into_words=True) to tokenize the incoming text. However, I can't figure out how to do the same in a pipeline for predictions. For example, the following works (but requires incoming text to be a string): s1 = "Here is a sentence" p1 = pipeline("ner",model=model,tokenizer=tokenizer) p1(s1) But the following raises the following error: Exception: …
Category: Data Science

How to do NER predictions with Huggingface BERT transformer

I am trying to do a prediction on a test data set without any labels for an NER problem. Here is some background. I am doing named entity recognition using tensorflow and Keras. I am using huggingface transformers. I have two datasets. A train dataset and a test dataset. The training set has labels, the tests does not. Below you will see what a tokenized sentence looks like, what it's labels look like, and what it looks like after encoding …
Category: Data Science

Transformer similarity fine-tuned way too often predicts pairs as similar

I fine-tuned a transformer for classification to compute similarity between names. This is a toy example for the training data: name0 name1 label Test Test y Test Hi n I fined-tuned the transformer using the label and feeding it with pairs of names as its tokenizer allows to feed 2 pieces of text. I found a really weird behavior. At prediction times, there exist pairs that have very high chances to be predicted as similar just because they have repeated …
Category: Data Science

How to prepare texts to BERT/RoBERTa models?

I have an artificial corpus I've built (not a real language) where each document is composed of multiple sentences which again aren't really natural language sentences. I want to train a language model out of this corpus (to use it later for downstream tasks like classification or clustering with sentence BERT) How to tokenize the documents? Do I need to tokenize the input like this: <s>sentence1</s><s>sentence2</s> or <s>the whole document</s> How to train? Do I need to train an MLM …
Category: Data Science

Finetune XLM-RoBERTa on TF-keras for text classification

I am trying to finetune pre-trained XLM-RoBERTa on Tensorflow-keras. I am using dataset in English for text classification. I have used xlm-roberta-base tokenizer to tokenize the sentences. I am using roberta-base model from TFRobertaForSequenceClassification. Please find the code below. optimizer=tf.keras.optimizers.SGD(learning_rate=5e-2) model.compile(optimizer = optimizer, loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]) model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=1, verbose=1) I am getting below error while training the model.Can anyone help me to solve this error? InvalidArgumentError: indices[2,268] = 124030 is not in [0, 50265) [[node tf_roberta_for_sequence_classification_1/roberta/embeddings/Gather …
Category: Data Science

Adding a new token to a transformer model without breaking tokenization of subwords

I'm running an experiment investigating the internal structure of large pre-trained models (BERT and RoBERTa, to be specific). Part of this experiment involves fine-tuning the models on a made-up new word in a specific sentential context and observing its predictions for that novel word in other contexts post-tuning. Because I am just trying to teach it a new word, we freeze the embeddings for the other words during fine-tuning so that only the weights for the new word are updated. …
Category: Data Science

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors. from transformers import AutoTokenizer, AutoModel # download and load model tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.