Periodically executing a scraping script with Python
Here is my idea and my early work.
My target
- Fetch 1-hour resolution air pollution data from China's goverment continuously.
- The website's data which collected from the monitor sites over the country update per hour .
My Code
Now, I can grab the useful information for a single hour. Here is my code:
Input the website links for different pollution(co,no2,pm10, etc)
html_co = urllib.urlopen("http://www.pm25.in/api/querys/co.json?city=beijingtoken=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8') html_no2 = urllib.urlopen("http://www.pm25.in/api/querys/no2.json?city=beijingtoken=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8') html_pm10 = urllib.urlopen("http://www.pm25.in/api/querys/pm10.json?city=beijingtoken=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8')
Get the content of the html doc.
soup_co = BeautifulSoup(html_co) soup_no2 = BeautifulSoup(html_no2) soup_pm10 = BeautifulSoup(html_pm10)
Extract the useful information from the whole content.
l = soup_co.p.get_text() co= json.loads(l) l = soup_no2.p.get_text() no2= json.loads(l) l = soup_pm10.p.get_text() pm10= json.loads(l)
Tight the raw data into neat Pandas.Dataframe.
data = {"time":[],"station":[],"code":[],"co":[],"no2":[],"pm10":[]} for i in range(0,len(pm10)-1,1): ## 'station' is the monitor station's name in Chinese data["station"].append(co[i]["position_name"]) data["time"].append(co[i]["time_point"]) data["co"].append(co[i]["co"]) ## 'code' is the monitor station's index data["code"].append(co[i]["station_code"]) data["no2"].append(no2[i]["no2"]) data["pm10"].append(pm10[i]["pm10"])
My result
Some pre-explanation
- Ignore the chinese character in the table.
- I only grab one city(Beijing here)'s data, and index from 0-11 notify that Beijing has 12 monitor site.
The columns 'co'/'NO2'/'PM10' represent the concentration of these air pollutants.
My problem
Now, I can grab the web data manually according my code above. But, I want to achieve the working flow below hourly automatically.
Hour i
Execute the code
(1) Grab the data for Hour i's air pollutants data from website;
(2) Save the data into .csv based on the true date(like 20160101.csv)
After one hour.
Execute the code
(1) Grab the data for Hour i+1's air pollutants data from website;
(2) Save the data into .csv based on the true date.
if it's the same day liken to hour i --> same .csv(like 2016-01-01.csv)
if the present day has past --> creat a new .csv(like 2016-01-02.csv)
I havn't done these kind of stuff before. Can somebody offer me some advice?
So, I can get an useful data scraping tool run in the background and I don't have to worry about it.
Category Data Science