Parsing and storing a large amount of HTML data
I have a data chunk (~30k) in which I have htmls pages and pngs saved in a folder for websites. These folders are titled based on some randomly generated hashes. My supervisor wants me to crunch through this data chunk and extract some attributes out of each HTML page and store it in a DB for future use. Attributes to be extracted comprises of page titles and copyright section from the HTML. As per my understanding this data is unstructured because there is no relation per say in the folder data for now. Moreover, there is a somewhat inherent structure which is of HTML but essentially each page mutually disjoint with the rest which qualifies for unstructured. Please correct me if I am wrong here.
Manager wishes to have the data stored in an ELK stack. By storing, he is quite unclear at this point in time but so far he wants to have the whole HTML file, title and copyright for each single HTML file extracted and stored. Here comes my first concern which I need help with.
- Is it a good idea to store whole HTML file into DB? I am of the opinion that we place HTML files in a centralized storage on some kind of FS and store the absolute paths of those files against each entry in the DB (we are already doing the same thing for PNGs btw).
I haven't worked with ELK stack and I thought it would be a good learning opportunity. While going through online tutorials I have learned that it is essentially for logs parsing from different applications servers and storing and visualizing them in a presentable and searchable manner.
- If anyone can comment about ELK, if it will work in my case, that
would be very helpful.
So far the end objective is to crunch through this data and store the attributes and when required search through the attributes and use them as per future need. For example, if there is a specific copy right text that is coming up very frequently, then get that copyright text and use it for classifying certain pattern which takes to my third and last question.
- Will it help to store it in a non relational database and then query accordingly? In my opinion RDBMS like mysql is a better contender because it will be easy to search through the tables for a specific type of title and then use it accordingly. End goal is not visualization, but to have data at hand to use whenever required.
Topic parsing bigdata data-mining
Category Data Science