Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?
It seems like a lot of noteworthy AI tools are being trained on datasets generated by web crawlers rather than human-edited, human-compiled corpora (Facebook Translate, GPT-3). In general, it sounds more ideal to have an automatic and universal way of generating a dataset.
Is there any ubiquitous web crawler which does basically the same thing as Common Crawl but has a parameter for “language sought”? In other words, generate a web-crawled dataset in language X?
(Background: I’d like to create a language dataset in any language, then train a lemmatizer on it, a function that can lemmatize words in that language.)
Topic openai-gpt crawling nlp
Category Data Science