Best way to store scraped web pages?
I'm going to scrape the HTML code from a large number of ULRs and store them on my computer, for machine learning purposes (basically, I'm going to use Python and PyTorch to train a neural network on this data). What is the best way to store the HTML code for all the web pages?
I want to be able to see which URLs I have already scraped, so that I don't have to scrape them again, and for each piece of HTML code (1 piece of HTML code = all extracted HTML code from one URL), I may also want to see which URL it came from (but maybe this turns out to be an unnecessary requirement). I also want to be able to see the timestamp when these pages were created so I can read them in chronological order (I will be able to extract timestamp when I download the web pages), and possibly other metadata. I imagine that the total file size of the HTML code can reach many GB, if not TB, and speed (both for reading and scraping) is a high priority.
In an ideal world, I would be able to just use the URLs as file names, have one file for each piece of HTML code, and store all files in one folder. But I'm not sure this is such a good idea, or even possible, for several reasons, for example:
- URLs contain characters that are not allowed or at least not suitable in file names (e.g. '
/
'). It may be possible to hash each URL and let that be the name of the file, but then I don't know what URL the HTML code came from by looking at the file name. - I'm not sure the operating system likes having millions of files in one folder because of of the file I/O overhead. Maybe grouping a number of pieces of HTML code in each file would be a better, e.g.
- I'm not sure whether storing each piece of HTML code in a separate file is optimal when it comes to file reading.
- It's not obvious how to store the metadata for each piece of HTML code.
Topic web-scraping
Category Data Science