Tools/tutorials for compiling corpora for NLP experiments?

Question

Tools/tutorials for compiling corpora for NLP experiments?

Alex S Kinman

2020年1月21日 02:00

I have a couple of NLP ideas I want to try out (mostly for my own learning) - while I have the python/tensorflow background for running the actual training and prediction tasks, I don't have much experience in processing large amounts of text data and whatever pipelines are involved.

Are there any tutorials on how to gather data and label it for a larg(ish) NLP experiment?

For example: BERT was originally trained on all of English Wikipedia. How do you go about gathering all of the text from Wikipedia's 5.9 Million + articles in a repository in the right format? How do you go about tokenizing such a large corpus Do things like NLTK and Beautiful soup still work on such large data sets?

If I have website or multiple website on a topic specific topic that I want to come up with some NLP models for, are there any Webscraping APIs that can pull all of that into one place? Any tutorials, tools would be very welcome, Thanks

Topic corpus pipelines nlp tools

Category Data Science

Tools/tutorials for compiling corpora for NLP experiments?

About