内容目录

一、目的二、实现内容1.分析html、url结构2.编写爬虫程序2.1 组合url2.2 获取月均数据2.3 城市日均值四、不足之处五、补充内容1.其他空气污染数据2.城市经纬度信息

学习Python中有不明白推荐加入交流群

号：984632579
群里有志同道合的小伙伴，互帮互助，
群里有不错的视频学习教程和PDF！

一、目的

从中国空气质量在线监测分析平台抓取全国384个城市2013年以来的月均和日均空气污染相关数据。
数据地址：https://www.aqistudy.cn/historydata/

二、实现内容

1.分析html、url结构

首先选择月数据进行爬虫测试，在网站中选择一个城市，分析其月均数据网页结构。
在这里我选择了北京，在页面里可以看到北京从2013.12以来的每个月份的空气污染数据，右击表格中的任意数据，用谷歌开发者工具进行检查，可以在Elements得到网页呈现的完整的html信息。这里可以看到我们所需要的时间、空气污染要素信息都存放在中（这里我理所当然地认为当进行页面请求时会直接获取到完整的html并能从中提取到需要的信息，其实并没有那么简单）。

在发现能够从html中找到我们所需要的数据后，就开始检查需要爬取网页的url是否能够进行组合。我需要的数据包括城市月均数据、城市日均数据，查看了北京的月均数据的url为https://www.aqistudy.cn/historydata/monthdata.php?city=北京，日均数据url为https://www.aqistudy.cn/historydata/daydata.php?city=北京&month=201312。月均数据每个城市存放在一个页面中，日均数据每个城市每个月份的日均值存放在一个页面中。对url进行简单分析可以看到月均的url变化的只有对应城市的名称，日均url中变化的只有月份和城市，这些都可以进行简单的组合获得。
在确认了能从html中获取信息，并可以合成url后，就开始程序的编写。

2.编写爬虫程序

2.1 组合url

需要爬取的月份列表比较好获取

def get_month_set():
month_set = [‘201312’]
for year in [2014,2015,2016,2017,2018]:
for month in [‘01’,‘02’,‘03’,‘04’,‘05’,‘06’,
‘07’,‘08’,‘09’,‘10’,‘11’,‘12’]:
month_set.append(’%s%s’%(year,month))
month_set.extend([‘201901’,‘201902’,‘201903’])
return month_set
城市的列表相对较麻烦，不过可以在https://www.aqistudy.cn/historydata/中找到所有城市的信息。编写python程序，获取html，并从”*”中根据正则判断提取需要的城市名称，输出到文本中。

2.2 获取月均数据

同样以北京为例，北京月均数据的url为https://www.aqistudy.cn/historydata/monthdata.php?city=北京，按照获取城市列表的方法，先用requests获取html，再用beautifulsoup 找寻其中内容。

city = ‘北京’
url = ‘https://www.aqistudy.cn/historydata/monthdata.php?city=%s’%(city)
html = requests.get(url)
soup = BeautifulSoup(html.text)
td_lists=soup.find_all(‘td’)
可是最后得到的td_lists中没有有效值，打印出html中所有内容发现html中并没有我们想要的表格内容，有的只有一段function 内容中存在td信息，这里说明数据是以JavaScript动态输入到html中的，使用requests方法并不能获取到需要的完整的html。

function showTable(items) {
items.forEach(function(item) {
// $('.table tbody').append(` // <tr> // <td align="center"><a href="daydata.php?city=$ {city}&month= ${item.time_point}">$ {item.time_point}
// ${item.aqi}</td> // <td align="center">$ {item.min_aqi}~${item.max_aqi}
// ${item.quality}</span></td> // <td align="center">$ {item.pm2_5}
// ${item.pm10}</td> // <td align="center">$ {item.so2}
// ${item.co}</td> // <td align="center">$ {item.no2}
// ${item.o3}
// `);
这里我找了一些方法，最后选择通过使用selenium.webdriver的方法来访问服务器获取完整的html。
selenium.webdriver简单的理解就是利用浏览器原生的API，封装成一套更加面向对象的SeleniumWebDriverAPI，直接操作浏览器页面里的元素，甚至操作浏览器本身。
这里我选用chrome浏览器，可以从http://chromedriver.chromium.org/上根据本机chrome版本下载对应的chromedriver。

下面是对应代码，这里需要注意的是在driver.get(url)后需要加上等待时间，浏览器访问网页并返回完整的内容需要时间，不然无法获取正确的网页。
这里还用到了pandas模块里的read_html函数，可以直接提取页面中的表格内容，并存为pandas.dataframe结构。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from urllib import parse
import pandas as pd

#获取城市信息
def get_city_set():
with open(’./cities.txt’ ,‘r’)as f:
reader = f.readlines()
for i in range(len(reader)):
reader[i]= reader[i].split(’\n’)[0]
return reader
#获取城市信息
city_set = get_city_set()

#浏览器不提供可视化页面
chrome_options = Options()
chrome_options.add_argument(’–headless’)

count=0
city_month_data = []

file_name = ‘C:/Temp/data_out/全国城市空气污染月均数据.txt’
fp = open(file_name, ‘w’)
#月均数据
for city in city_set:

city = ‘南京’

#组成城市月均值url
url = 'https://www.aqistudy.cn/historydata/monthdata.php?city=%s'%(city)
#打开浏览器
driver = webdriver.Chrome('C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe',chrome_options=chrome_options)
#访问对应url
driver.get(url)
#等待浏览器加载页面---------重要-----------
time.sleep(1)
#获取页面中的表格
dfs = pd.read_html(driver.page_source,header=0)[0]
#判断dfs是否有数据，没有数据则增加等待时间
if len(dfs)==0:
    driver.get(url)
    print('Please wait %s seconds.'%(3))
    time.sleep(3)
    dfs = pd.read_html(driver.page_source,header=0)[0]
#存储数据到文件
for j in range(0,len(dfs)):
    date = dfs.iloc[j,0]
    aqi = dfs.iloc[j,1]
    grade = dfs.iloc[j,2]
    pm25 = dfs.iloc[j,3]
    pm10 = dfs.iloc[j,4]
    so2 = dfs.iloc[j,5]
    co = dfs.iloc[j,6]
    no2 = dfs.iloc[j,7]
    o3 = dfs.iloc[j,8]
    fp.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (city,date,aqi,grade,pm25,pm10,so2,co,no2,o3)))
print('%d---%s---DONE' % (len(dfs), city))
#关闭浏览器
driver.quit()
count +=1

print (’%s已经爬完！请检测！’%(city))
fp.close()
程序开始运行

2.3 城市日均值

日均值爬虫逻辑与月均类似，只是在组合url的时候加入了月份，并按照城市为单位存入不同文件，具体代码如下：

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from urllib import parse
import pandas as pd

#下载城市对应月份数据
def download_city(city_set,month_set):
count= 0
base_url = ‘https://www.aqistudy.cn/historydata/daydata.php?city=’
for city in city_set: #####
#city=‘成都’
driver = webdriver.Chrome(‘C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe’,chrome_options=chrome_options)
file_name = ‘C:/Temp/data_out/日均值_%s.txt’%(city)
fp = open(file_name, ‘w’)
for month in month_set:
weburl = (’%s%s&month=%s’ % (base_url, parse.quote(city),month))
driver.get(weburl)
time.sleep(1)
dfs = pd.read_html(driver.page_source,header=0)[0]
if len(dfs)==0:
driver.get(weburl)
print(‘Please wait %s seconds.’%(3))
time.sleep(3)
dfs = pd.read_html(driver.page_source,header=0)[0]
if len(dfs)==0:
print(’%d—%d—%s—%s—DONE’ % (count,len(dfs), city,month))
continue
for j in range(0,len(dfs)):
date = dfs.iloc[j,0]
aqi = dfs.iloc[j,1]
grade = dfs.iloc[j,2]
pm25 = dfs.iloc[j,3]
pm10 = dfs.iloc[j,4]
so2 = dfs.iloc[j,5]
co = dfs.iloc[j,6]
no2 = dfs.iloc[j,7]
o3 = dfs.iloc[j,8]
fp.write((’%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n’ % (city,date,aqi,grade,pm25,pm10,so2,co,no2,o3)))
print(’%d—%d—%s—%s—DONE’ % (count,len(dfs), city,month))
fp.close()
driver.quit()
print (’%s已经爬完！请检测！’%(city))
count +=1
#获取城市信息
def get_city_set():
with open(’./cities.txt’ ,‘r’)as f:
reader = f.readlines()
for i in range(len(reader)):
reader[i]= reader[i].split(’\n’)[0]
return reader
#获取所需要的月份
def get_month_set():
month_set = [‘201312’]
for year in [2014,2015,2016,2017,2018]:
for month in [‘01’,‘02’,‘03’,‘04’,‘05’,‘06’,
‘07’,‘08’,‘09’,‘10’,‘11’,‘12’]:
month_set.append(’%s%s’%(year,month))
month_set.extend([‘201901’,‘201902’,‘201903’])
return month_set

if name == ‘main’:
#获取月份数据
month_set=get_month_set()
#获取城市信息
city_set = get_city_set()

#浏览器不提供可视化页面
chrome_options = Options()
chrome_options.add_argument('--headless')

#下载城市对应月份数据
download_city(city_set,month_set)

四、不足之处

获取单个城市所有时间的数据需要时间不多，但是对于接近400个城市，单线程会花费很多时间，之后还需要将其改为多线程并发来节约时间。

五、补充内容

1.其他空气污染数据

在找寻空气污染相关历史数据的时候，我还发现了另外一个全国空气质量历史数据。网站了整理全国多个城市、监测站点的小时历史数据，并且每天进行更新。

2.城市经纬度信息

我根据搜集了网站上所有城市、地区的经纬度信息，大多数是通过互联网已有的城市经纬度获取，有些地区并没有能够从中找到，我手动搜索补充完整了，完整的城市对应的经纬度文件在原文链接中可以获取（提取码：s4vj）。

Python爬取真气网天气数据

city = ‘南京’

猜你喜欢