Computer science corpus for training a language model

I am looking for a domain specific computer science corpus of at least 20M words (preferable >50M words), for the purpose of training a language model in it.

Is there anything out-of-the box that I could use? *I tried to look for the sciBERT corpus, can not find how to access it.

Thanks!

Topic corpus text text-mining nlp data-mining

Category Data Science


Depends on the domain and language, but I'll share an adaptive example.

The wikipedia corpus's English version contains more than 1.9 billion words from 4.4 million articles.

You can create create virtual corpora from the full corpus to contain only topics of interest, such as biology, investments, Buddhism, psychology, cars, basketball, etc.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.