Tools/tutorials for compiling corpora for NLP experiments?
I have a couple of NLP ideas I want to try out (mostly for my own learning) - while I have the python/tensorflow background for running the actual training and prediction tasks, I don't have much experience in processing large amounts of text data and whatever pipelines are involved.
Are there any tutorials on how to gather data and label it for a larg(ish) NLP experiment?
For example: BERT was originally trained on all of English Wikipedia. How do you go about gathering all of the text from Wikipedia's 5.9 Million + articles in a repository in the right format? How do you go about tokenizing such a large corpus Do things like NLTK and Beautiful soup still work on such large data sets?
If I have website or multiple website on a topic specific topic that I want to come up with some NLP models for, are there any Webscraping APIs that can pull all of that into one place? Any tutorials, tools would be very welcome, Thanks
Topic corpus pipelines nlp tools
Category Data Science