Periodically executing a scraping script with Python

Here is my idea and my early work.

My target

  • Fetch 1-hour resolution air pollution data from China's goverment continuously.
  • The website's data which collected from the monitor sites over the country update per hour .

My Code

Now, I can grab the useful information for a single hour. Here is my code:

  1. Input the website links for different pollution(co,no2,pm10, etc)

    html_co = urllib.urlopen("http://www.pm25.in/api/querys/co.json?city=beijingtoken=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8')
    html_no2 = urllib.urlopen("http://www.pm25.in/api/querys/no2.json?city=beijingtoken=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8')
    html_pm10 = urllib.urlopen("http://www.pm25.in/api/querys/pm10.json?city=beijingtoken=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8')
    
  2. Get the content of the html doc.

    soup_co = BeautifulSoup(html_co)
    soup_no2 = BeautifulSoup(html_no2)
    soup_pm10 = BeautifulSoup(html_pm10)
    
  3. Extract the useful information from the whole content.

    l = soup_co.p.get_text() 
    co= json.loads(l) 
    l = soup_no2.p.get_text() 
    no2= json.loads(l) 
    l = soup_pm10.p.get_text() 
    pm10= json.loads(l)       
    
  4. Tight the raw data into neat Pandas.Dataframe.

    data = {"time":[],"station":[],"code":[],"co":[],"no2":[],"pm10":[]}
    for i in range(0,len(pm10)-1,1):
        ## 'station' is the monitor station's name in Chinese
        data["station"].append(co[i]["position_name"])
        data["time"].append(co[i]["time_point"])
        data["co"].append(co[i]["co"])
        ## 'code' is the monitor station's index
        data["code"].append(co[i]["station_code"])
        data["no2"].append(no2[i]["no2"])
        data["pm10"].append(pm10[i]["pm10"])
    

My result

Some pre-explanation

  • Ignore the chinese character in the table.
  • I only grab one city(Beijing here)'s data, and index from 0-11 notify that Beijing has 12 monitor site.
  • The columns 'co'/'NO2'/'PM10' represent the concentration of these air pollutants.

My problem

Now, I can grab the web data manually according my code above. But, I want to achieve the working flow below hourly automatically.

Hour i

  • Execute the code

  • (1) Grab the data for Hour i's air pollutants data from website;

  • (2) Save the data into .csv based on the true date(like 20160101.csv)

After one hour.

  • Execute the code

  • (1) Grab the data for Hour i+1's air pollutants data from website;

  • (2) Save the data into .csv based on the true date.
    if it's the same day liken to hour i --> same .csv(like 2016-01-01.csv)
    if the present day has past --> creat a new .csv(like 2016-01-02.csv)

I havn't done these kind of stuff before. Can somebody offer me some advice?
So, I can get an useful data scraping tool run in the background and I don't have to worry about it.

Topic scraping dataset python

Category Data Science


I would wrap what you currently have in a function. For instance:

def write_excel():
     #your existing code

Then you can import BlockingScheduler which is a library that you can run your script continuously. If you change your extension from .py to .pyw and run the .pyw it will run in the background until you close it with task manager.

from apscheduler.schedulers.blocking import BlockingScheduler

scheduler = BlockingScheduler()
scheduler.add_job(write_excel, 'interval', hours=1)
scheduler.start()

As Emre already put it nicely, you can use cron for the job.

Having said that, I would suggest you to use any one of these:

  • Luigi
  • Airflow

These are dependency management, scheduler libraries for Python, which are very easy to use and much better than vanilla cron.

And as I trigger hundreds of Airflow scripts everyday in production, I can as well vouch for it's ease of use and usefulness.

So, this is a screenshot from my computer, of a very basic Airflow script running:

enter image description here

So, here the tasks are dependent on each other. For example, the second task is dependant on the first one. So, if the first task fails, the second automatically fails, so that you don't need to fire-fight with the code/database everytime.

So, if you need the analytics task depends on the scripting task, it would make sure both occur as they are intended to.

In addition, you can do this:

  1. Set the Airflow script to run every hour automatically for you
  2. Trigger emails whenever an error occurs (which is very helpful when the task runs when you aren't monitoring them)
  3. Neat visualization of the failed, running and completed tasks.

Helpful reading:

My answer for DAG based script execution scheduler/planner with GUI; similar to airbnb's airflow or BODS or Informatica


If you are on a Unix-based OS, use a scheduler like cron to run your script periodically. cron is the most common one, and you can find many tutorials for it. I guess this is the part you are unfamiliar with.

When it executes your script, make it check the time like so: fname=datetime.datetime.now().strftime('%Y-%m-%d-%H.csv'). Open the appropriate file in append mode (with open(fname, 'a') as csv_file) ; if it does not exist it will be created. That's all there's to it!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.