Computer science corpus for training a language model

Question

Computer science corpus for training a language model

user

2022年3月16日 15:02

I am looking for a domain specific computer science corpus of at least 20M words (preferable >50M words), for the purpose of training a language model in it.

Is there anything out-of-the box that I could use? *I tried to look for the sciBERT corpus, can not find how to access it.

Thanks!

Topic corpus text text-mining nlp data-mining

Category Data Science

gust · Accepted Answer · 2020年2月20日 15:43

Depends on the domain and language, but I'll share an adaptive example.

The wikipedia corpus's English version contains more than 1.9 billion words from 4.4 million articles.

You can create create virtual corpora from the full corpus to contain only topics of interest, such as biology, investments, Buddhism, psychology, cars, basketball, etc.

Computer science corpus for training a language model

About