How to prepare texts to BERT/RoBERTa models?
I have an artificial corpus I've built (not a real language) where each document is composed of multiple sentences which again aren't really natural language sentences.
I want to train a language model out of this corpus (to use it later for downstream tasks like classification or clustering with sentence BERT)
How to tokenize the documents?
Do I need to tokenize the input
like this:
ssentence1/sssentence2/s
or sthe whole document/s
How to train?
Do I need to train an MLM or an NSP or both?
Topic huggingface bert transformer deep-learning nlp
Category Data Science