Complex HTMLs Data Extraction with Python

Does anybody know a way of extracting data with python from more convoluted website structures? For example, I'm trying to extract data from the players in the ATP profiles, but it's just so complicated I quit. I think they're pulling data from some database in the script and I suspect that even if I tried I wouldn't be able to get it.

I then started using a specialized software called ParseHub, which pulls the data somewhat visually. It's a pretty good software, but they make it slow on purpose just so you buy it, and it is particularly not cheap.

Topic scraping python

Category Data Science


I ended up using BeautifulSoup to do the job. The final code is not that clean - as I'm kind of a newbie - but it does what I wanted. You can find the source code and the dataset I've extracted so far in this repo. You can also check out an article I did on this in my website: fanaro.com.br.


Been there, done that, it is still hard. For complex HTML sources, using shallow feature analysis proved to be best -- so a package like Dragnet is a good place to start.

Our final result was a process chain (luigi), where we could mix, match and reorder the following text extraction tools by HTML source:

  • Shallow feature extraction (Dragnet)
  • Tag Stripping (Python stdlib)
  • Preg Replace (re Module)
  • Html2Text
  • BeautifulSoup
  • Pass Through

The key is what you are planning to do with it in the next step. For some things, your can just carry the tags in the text (boolean find), others you cannot (classification between sites). Configure, test, repeat.

But a one size fits all, that is really hard -- I am also not convinced that some of the services can do much beyond Dragnet.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.