Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?

It seems like a lot of noteworthy AI tools are being trained on datasets generated by web crawlers rather than human-edited, human-compiled corpora (Facebook Translate, GPT-3). In general, it sounds more ideal to have an automatic and universal way of generating a dataset.

Is there any ubiquitous web crawler which does basically the same thing as Common Crawl but has a parameter for “language sought”? In other words, generate a web-crawled dataset in language X?

(Background: I’d like to create a language dataset in any language, then train a lemmatizer on it, a function that can lemmatize words in that language.)

Topic openai-gpt crawling nlp

Category Data Science


You would need huge computing power and storage to do that. Unless you have access to resources such as those owned by Google or Facebook, it doesn't seem very realistic.

Usually one doesn't need such a huge amount of data in order to train a lemmatizer, because natural languages have a limited number of morphological patterns. I'd suggest using the Universal Dependencies corpus, which contains annotated text for more than 100 languages.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.