Parsing and storing a large amount of HTML data

Question

Parsing and storing a large amount of HTML data

nainometer

2020年10月18日 21:08

I have a data chunk (~30k) in which I have htmls pages and pngs saved in a folder for websites. These folders are titled based on some randomly generated hashes. My supervisor wants me to crunch through this data chunk and extract some attributes out of each HTML page and store it in a DB for future use. Attributes to be extracted comprises of page titles and copyright section from the HTML. As per my understanding this data is unstructured because there is no relation per say in the folder data for now. Moreover, there is a somewhat inherent structure which is of HTML but essentially each page mutually disjoint with the rest which qualifies for unstructured. Please correct me if I am wrong here.

Manager wishes to have the data stored in an ELK stack. By storing, he is quite unclear at this point in time but so far he wants to have the whole HTML file, title and copyright for each single HTML file extracted and stored. Here comes my first concern which I need help with.

Is it a good idea to store whole HTML file into DB? I am of the opinion that we place HTML files in a centralized storage on some kind of FS and store the absolute paths of those files against each entry in the DB (we are already doing the same thing for PNGs btw).

I haven't worked with ELK stack and I thought it would be a good learning opportunity. While going through online tutorials I have learned that it is essentially for logs parsing from different applications servers and storing and visualizing them in a presentable and searchable manner.

If anyone can comment about ELK, if it will work in my case, that
would be very helpful.

So far the end objective is to crunch through this data and store the attributes and when required search through the attributes and use them as per future need. For example, if there is a specific copy right text that is coming up very frequently, then get that copyright text and use it for classifying certain pattern which takes to my third and last question.

Will it help to store it in a non relational database and then query accordingly? In my opinion RDBMS like mysql is a better contender because it will be easy to search through the tables for a specific type of title and then use it accordingly. End goal is not visualization, but to have data at hand to use whenever required.

Topic parsing bigdata data-mining

Category Data Science

Paul · Accepted Answer · 2019年5月25日 14:32

The terms "structured data" or "unstructured data" are not defined in such a way that a given dataset is always either one or the other. There are gray areas and I think this is one example. Since you cannot rely on the structure in your data, I would categorize this as unstructured.

To understand if it's a good idea to store the whole HTML in the DB (and same question for the PNGs), you need to weigh the pro's and con's. Pro storing everything in the DB is the simplicity: You don't have separate places where data is stored, so if you take a snapshot from the DB at some point in time and restore it, you restore the entire state as it was at that time. You do not need to worry about your disk storage separately, to restore that to a given state. Against storing everything in the DB is the amount of data. Can the DB handle it, or does performance suffer too much? Think about retrieving the data, searching/querying it, storing data, making back-ups. This will depend on your choice of database.

The same question for ELK and MySQL: What are the pros and cons of each? MySQL is simpler to install, that's always good. MySQL gives you a relational datamodel (tables can be related using foreign keys). Is that an advantage? MySQL gives you transactions. Is that helpful? ELK mainly gives you scalability, meaning that probably it would allow everything to be stored in the DB and still meet your performance needs.

If you can't store everything in MySQL (HTML and PNGs), then before choosing to store part of the data somewhere else, my first option would be to change DB technology to something that can store everything, rather than to start storing things in different places. So in that case ELK might be a good option, but store the PNGs there, too.

Parsing and storing a large amount of HTML data

About