How to prepare texts to BERT/RoBERTa models?

I have an artificial corpus I've built (not a real language) where each document is composed of multiple sentences which again aren't really natural language sentences.

I want to train a language model out of this corpus (to use it later for downstream tasks like classification or clustering with sentence BERT)

How to tokenize the documents?

Do I need to tokenize the input

like this: ssentence1/sssentence2/s

or sthe whole document/s

How to train?

Do I need to train an MLM or an NSP or both?

Topic huggingface bert transformer deep-learning nlp

Category Data Science


You can use existing libraries to tokenize.

From the docs on Github:

For sentence-level tasks (or sentence-pair) tasks, tokenization is very simple. Just follow the example code in run_classifier.py and extract_features.py. The basic procedure for sentence-level tasks is:

Instantiate an instance of tokenizer = tokenization.FullTokenizer

Tokenize the raw text with tokens = tokenizer.tokenize(raw_text).

Truncate to the maximum sequence length. (You can use up to 512, but you probably want to use shorter if possible for memory and speed reasons.)

Add the [CLS] and [SEP] tokens in the right place.

In the original paper (Section 3) it is said that:

To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

Masked LM (Task 1) and Next Sentence Prediction (NSP, Task 2) are both part of pretraining in the original paper (Section 3.1). For "classification only" you may be "okay" with Task 1 (MLM) depending on the problem. However, both MLM and NSP seem to be important to achieve "good" results. The motivation for NSP is described in the paper in the following words:

Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus

For more technical aspects (i.e. using the transformers library), you may see this discussion on SO.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.