How to scrape a table from a webpage?
I need to scrape a table off of a webpage and put it into a pandas data frame. But I am not being able to do it. Let me first give you a hint of how the table is encoded into html document.
tbody
tr
th colspan="2"United States Totalstrong**/strong/th
tdstrong15,069.0/strong/td
tdstrong14,575.0/strong/td
tdstrong100.0/strong/td
td/td
td/td
/tr
tr
th colspan="7"Arizona/th
/tr
tr
tdPinal Energy, LLC/td
tdMaricopa, AZ/td
td50.0/td
td50.0/td
tdNA/td
td2012-07-01/td
td2014-03/td
/tr
tr
td colspan="2"strongArizona Total/strong/td
td50.0/td
td50.0/td
tdNA/td
td/td
td/td
/tr
tr
The body of the table begins with tbody..../tbody
. Each tr..../tr
is a row of the table.Within each row, that is within each pair of tr..../tr
, each column is given by td50.0/td
.
Here are my questions:
1) How do I scrape it ? I am using BeautifulSoup
and requests
for this purpose as well as pandas
module. I tried the following:
r = requests.get(url)
bs = BeautifulSoup(r.text)
info = bs.findALL('tr','td')
....
....
But it is giving me this error:
TypeError Traceback (most recent call last)
ipython-input-24-32d9483e2c59 in module()
1 bs = BeautifulSoup(r.text)
---- 2 info = bs.findALL('tr','td')
3 #print bs
TypeError: 'NoneType' object is not callable
2) I need to skip some of the rows based on the text in it. For example I don't want to read in the row in which the word 'Total' appears (as inth colspan="2"United States Totalstrong**/strong/th
). How do I do that ? Although, it is not extremely important as I can get rid off it later, but skipping these rows while reading the data is ideally what I need.
I know it is a long post, but if someone can help me with it, i would greatly appreciate it. Please let me know if more information is needed.
Thanks much.
Category Data Science