Scraping financial web data

I recently started working as a data scientist and I am starting a web scraping and NLP project using Python. The idea is to create a program that searches for public information on the company's clients. These information can come from various sources: annual reports, income statements, articles.... I will have to deal with two types of formats: HTML and PDFs. For now I will focus on retrieving the revenue of the company. After a month of research and tests, I realized a few things: - NLP techniques are too slow to be used on annuals reports The first step of the project will be the following:

Search for the annual report and scrape the HTML code: so far I managed to get all the google results and I'm using Beautifulsoup to get the HTML code. However I can't quite get the revenue of the company because each website has its own HTML structure. I first decided to focus on extracting tables (the goal is to find the company's income statement) but I realized that HTML tables are often used for layout (even if it's a bad practice). I can't rely on css selectors as I need to keep it as generic as possible. How can I achieve it?

Topic web-scraping nlp data-mining

Category Data Science


I too would fall back on parsing either the HTML or the entities using regular expressions. My experience is though that this always gets unelegant quickly.

  • Do you have a somewhat clear idea of the relevant sources? If the better part of the relevant data comes from a limited number of pages, you could maintain a list of sources with matching wrappers.
  • Then within those relevant documents, I would search for the least complex most valuable features to extract.

Example

For instance, if you'd be interested in the quarterlies of Alphabet, I would scrape this link. You're smart, you can figure out the next one.

A quick glance learns me that the first hit on revenue(s) $ returns me the revenue for the quarter.

So something like this:

(?:revenue[s]?)(?:\s[\w]+\s)(\$[\d]+\.?\d\s[\w]+)

Testing that one the reports on the site seems to work on q1, q2 and q3 while q4 yields the annual revenue. Easy enough to fix.

My experience is that thee patterns hold for a while, and then change. No big deal, just add a couple of tests! Fi: Is the result not empty and is it in a believable range?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.