Best way to store scraped web pages?

Question

Best way to store scraped web pages?

HelloGoodbye

2020年9月7日 07:41

I'm going to scrape the HTML code from a large number of ULRs and store them on my computer, for machine learning purposes (basically, I'm going to use Python and PyTorch to train a neural network on this data). What is the best way to store the HTML code for all the web pages?

I want to be able to see which URLs I have already scraped, so that I don't have to scrape them again, and for each piece of HTML code (1 piece of HTML code = all extracted HTML code from one URL), I may also want to see which URL it came from (but maybe this turns out to be an unnecessary requirement). I also want to be able to see the timestamp when these pages were created so I can read them in chronological order (I will be able to extract timestamp when I download the web pages), and possibly other metadata. I imagine that the total file size of the HTML code can reach many GB, if not TB, and speed (both for reading and scraping) is a high priority.

In an ideal world, I would be able to just use the URLs as file names, have one file for each piece of HTML code, and store all files in one folder. But I'm not sure this is such a good idea, or even possible, for several reasons, for example:

URLs contain characters that are not allowed or at least not suitable in file names (e.g. '/'). It may be possible to hash each URL and let that be the name of the file, but then I don't know what URL the HTML code came from by looking at the file name.
I'm not sure the operating system likes having millions of files in one folder because of of the file I/O overhead. Maybe grouping a number of pieces of HTML code in each file would be a better, e.g.
I'm not sure whether storing each piece of HTML code in a separate file is optimal when it comes to file reading.
It's not obvious how to store the metadata for each piece of HTML code.

Topic web-scraping

Category Data Science

Fnguyen · Accepted Answer · 2020年9月7日 07:41

Simply put you can create an "index data base" in the following format:

ID | URL | Timestamp | link_to_file | other_metadata

You could even store the actual file, instead of just the link with most databases.

However the simplest way might be the best here. This "index db" would be created automatically in the scraping process and serve as navigation and check to avoid double scraping, etc.

Best way to store scraped web pages?

About