Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?

Question

Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?

Peter Elbert

2021年11月18日 23:15

It seems like a lot of noteworthy AI tools are being trained on datasets generated by web crawlers rather than human-edited, human-compiled corpora (Facebook Translate, GPT-3). In general, it sounds more ideal to have an automatic and universal way of generating a dataset.

Is there any ubiquitous web crawler which does basically the same thing as Common Crawl but has a parameter for “language sought”? In other words, generate a web-crawled dataset in language X?

(Background: I’d like to create a language dataset in any language, then train a lemmatizer on it, a function that can lemmatize words in that language.)

Topic openai-gpt crawling nlp

Category Data Science

Erwan · Accepted Answer · 2021年11月18日 23:15

You would need huge computing power and storage to do that. Unless you have access to resources such as those owned by Google or Facebook, it doesn't seem very realistic.

Usually one doesn't need such a huge amount of data in order to train a lemmatizer, because natural languages have a limited number of morphological patterns. I'd suggest using the Universal Dependencies corpus, which contains annotated text for more than 100 languages.

Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?

About