Machine learning for predicting HTML Elements on a web page?
My goal is to implement an assistant for crawling web data for users that don't understand anything about HTML or DOM. I will show a web page to the user and the user has to select, what data he is interested on the page (or what data he is not interested in).
Example: If the user clicks on the cell inside a table, it is very likely he wants to extract all elements inside that column. He might only be interested in the table row, or he might be interested in only the one cell. So the algorithm proposes him three selectors for each possibility. The user might choose one of the proposed selector or he might click on another element to get a new proposition.
So far the use case. The component I want to create:
has as input data a list of elements inside a DOM (hierarchy of HTML elements), that the user wants to crawl
shall output a (probability) ranked list of selectors that will fit the users selection to 100% and possibly include other elements that the user is probably interested in.
The problem here is the high variability of websites, and the possibility to create similar looking results with very differently structured elements. Manually creating rules for prediction can therefore probably cover the most basic uses (like getting data from a table column), but will fail when it comes to finding article parts on a news page.
So finally my questions. I am thinking about the following processing:
first generate selectors that will fit the users selection
sort out the selectors that do not fit the users choice of elements by 100%
use a (machine learned?) model to predict which selector the user might be interested in
Do you think the above processing will be productive for this problem?
What algorithm could be most suitable for ranking the selectors in 3.?
Is there an idea on how to auto-magically come up with a list of selectors for the selected elements?
I hope I am making sense - I do not have a very strong background in data analysis and machine learning so I am hoping for some directions where to look into.