How to scrape a table from a webpage?

I need to scrape a table off of a webpage and put it into a pandas data frame. But I am not being able to do it. Let me first give you a hint of how the table is encoded into html document.

tbody
tr
th colspan="2"United States Totalstrong**/strong/th
tdstrong15,069.0/strong/td
tdstrong14,575.0/strong/td
tdstrong100.0/strong/td
td/td
td/td
/tr
tr
th colspan="7"Arizona/th
/tr
tr
tdPinal Energy, LLC/td
tdMaricopa, AZ/td
td50.0/td
td50.0/td
tdNA/td
td2012-07-01/td
td2014-03/td
/tr
tr
td colspan="2"strongArizona Total/strong/td
td50.0/td
td50.0/td
tdNA/td
td/td
td/td
/tr
tr

The body of the table begins with tbody..../tbody. Each tr..../tr is a row of the table.Within each row, that is within each pair of tr..../tr, each column is given by td50.0/td.

Here are my questions:

1) How do I scrape it ? I am using BeautifulSoup and requests for this purpose as well as pandas module. I tried the following:

r = requests.get(url)
bs = BeautifulSoup(r.text)
info = bs.findALL('tr','td')
  ....
  ....

But it is giving me this error:

TypeError                                 Traceback (most recent call last)
ipython-input-24-32d9483e2c59 in module()
      1 bs = BeautifulSoup(r.text)
---- 2 info = bs.findALL('tr','td')
      3 #print bs

TypeError: 'NoneType' object is not callable

2) I need to skip some of the rows based on the text in it. For example I don't want to read in the row in which the word 'Total' appears (as inth colspan="2"United States Totalstrong**/strong/th). How do I do that ? Although, it is not extremely important as I can get rid off it later, but skipping these rows while reading the data is ideally what I need.

I know it is a long post, but if someone can help me with it, i would greatly appreciate it. Please let me know if more information is needed.

Thanks much.

Topic scraping pandas python

Category Data Science


This will give you all the values under <tr>:

bs=BeautifulSoup(data, "lxml")
table_body=bs.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols=row.find_all('td')
    cols=[x.text.strip() for x in cols]
    print cols

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.