Machine learning for predicting HTML Elements on a web page?

My goal is to implement an assistant for crawling web data for users that don't understand anything about HTML or DOM. I will show a web page to the user and the user has to select, what data he is interested on the page (or what data he is not interested in).

Example: If the user clicks on the cell inside a table, it is very likely he wants to extract all elements inside that column. He might only be interested in the table row, or he might be interested in only the one cell. So the algorithm proposes him three selectors for each possibility. The user might choose one of the proposed selector or he might click on another element to get a new proposition.

So far the use case. The component I want to create:

  • has as input data a list of elements inside a DOM (hierarchy of HTML elements), that the user wants to crawl

  • shall output a (probability) ranked list of selectors that will fit the users selection to 100% and possibly include other elements that the user is probably interested in.

The problem here is the high variability of websites, and the possibility to create similar looking results with very differently structured elements. Manually creating rules for prediction can therefore probably cover the most basic uses (like getting data from a table column), but will fail when it comes to finding article parts on a news page.

So finally my questions. I am thinking about the following processing:

  1. first generate selectors that will fit the users selection

  2. sort out the selectors that do not fit the users choice of elements by 100%

  3. use a (machine learned?) model to predict which selector the user might be interested in

Do you think the above processing will be productive for this problem?

What algorithm could be most suitable for ranking the selectors in 3.?

Is there an idea on how to auto-magically come up with a list of selectors for the selected elements?

I hope I am making sense - I do not have a very strong background in data analysis and machine learning so I am hoping for some directions where to look into.

Topic hierarchical-data-format predictive-modeling data-mining machine-learning

Category Data Science


There are at least 2 distinct problems:

  1. Understanding the non-programmers intention for the webpage's content
  2. Understanding the webpage's content

The 1st problem is a problem not best solved with data science or machine learning, hard-coded rules will work better. The 2nd problem is mostly a software engineering problem. I would look into existing solutions for inspiration, for example scraper or import.io.

If you still want to sort and present the "selectors" to users, that is most frequently called a "learning to rank" problem. Learning to rank problems are common in information retrieval systems / search engines.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.