文章目录
项目展示
项目花费7天时间完成,页面中实现了六个功能:左侧上下两个功能分别是 各地区风险评定等级 和 百度今日热搜,中部上下两个功能分别是 总数据展示 和 实时数据地图,右侧上下两个功能分别是 各省累计确诊排名 和 今日发生疫情省份统计,这些数据的来源是通过爬虫去各个网站进行爬取,一共爬取的三个网站,接下来我将讲述如何去爬取这六个功能的数据来源,并且会附上源码供大家复制到自己项目中学习。
python爬虫
python爬虫爬取网易新闻疫情数据
咱爬取网易新闻疫情数据的网址如下:https://wp.m.163.com/163/page/news/virus_report/index.html?nw=1&anw=1 ,这个网址返回的是json文件数据,能在F12的Network中查看得到,很方便的可以将json文件处理转换成csv文件。代码如下:
爬取网易新闻疫情数据放入csv文件中:
import pandas as pd
import requests
headers={
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
url='https://c.m.163.com/ug/api/wuhan/app/data/list-total?'
res=requests.get(url,headers=headers)
import json
data_json=json.loads(res.text)
data=data_json['data']
data_province=data['areaTree'][2]['children']
print(data_province)
free_data=pd.DataFrame(data_province)[['id','lastUpdateTime','name']]
today_data=pd.DataFrame([province['today'] for province in data_province])
total_data=pd.DataFrame([province['total'] for province in data_province])
today_data.columns=("today_"+i for i in today_data.columns)
total_data.columns=("total_"+i for i in total_data.columns)
China_data=pd.concat([free_data,today_data,total_data],axis=1)
import time
# file_name='今天中国各省'+'_'+time.strftime('%Y_%m_%d',time.localtime(time.time()))+'.csv'
file_name='C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/china_today_data'+'.csv'
China_data.to_csv(file_name,index=None,encoding='utf_8_sig')
print("中国的各省疫情数据保存成功啦!")
代码思路是,获取到网页返回的json数据,我们利用python的类进行数据的处理,得到我们想要的数据。
接下来是的是将csv文件数据导入mysql中:
import pymysql
import csv
import codecs
def get_conn():
conn = pymysql.connect(host='localhost', port=3307, user='root', passwd='123456', db='china_covid-19', charset='utf8')
return conn
def insert(cur, sql, args):
cur.execute(sql, args)
def read_csv_to_mysql(filename):
with codecs.open(filename=filename, mode='r', encoding='utf-8') as f:
reader = csv.reader(f)
head = next(reader)
conn = get_conn()
cur = conn.cursor()
sql = 'insert into china_today_data values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'
cur.execute("Delete from china_today_data where 1=1")
conn.commit()
for item in reader:
args = tuple(item)
insert(cur, sql=sql, args=args)
conn.commit()
cur.close()
conn.close()
if __name__ == '__main__':
read_csv_to_mysql('C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/china_today_data.csv')
我的网易新闻疫情数据的mysql表结构:
中国今日疫情数据表结构是:一共有18个字段,每个字段类型皆为varchar。原因是从csv文件转存进mysql的过程都是默认以varchar的形式。
python爬虫爬取各地区风险评定等级数据
咱爬取各地区风险评定等级的网址如下:http://www.gd.gov.cn/gdywdt/zwzt/yqfk/content/post_3021711.html ,这个网址的接口稳定不变,我们可以稳定爬取该网站数据,代码如下:
爬取各地区风险评定等级数据
#Python爬取中高风险地区名单代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
def getHTML(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def getContent(url):
html = getHTML(url)
soup = BeautifulSoup(html, 'html.parser')
paras_tmp = soup.select('.zw-title')+soup.select('p')
paras = paras_tmp[0:]
return paras
def saveFile(text):
# datetimes = time.strftime("%Y-%m-%d",time.localtime(time.time()))
fname = "C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/"+r"Webdata.txt"
f = open(fname, 'w')
for t in text:
if len(t) > 0:
f.writelines(t.get_text() + "\n\n")
f.close()
def saveCSV(name,list):
test = pd.DataFrame(columns=name, data=list) # 数据有三列,列名分别为one,two,three
# datetimes = time.strftime("%Y-%m-%d", time.localtime(time.time()))
fname = "C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/"+r"list_of_grade_risk_areas.csv"
test.to_csv(fname, encoding='utf-8')
def main():
url = 'http://www.gd.gov.cn/gdywdt/zwzt/yqfk/content/mpost_3021711.html'
text = getContent(url)
saveFile(text)
# 打开文本文件
fp = open('C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/Webdata.txt', 'r')
# 使用readlines读取
lines = fp.readlines()
ss2 = []
for line in lines:
# 将读取的每行内容过滤掉换行符,如果不加这个条件,输入的内容中将会添加换行符\n
line = line.strip('\n')
line = line.strip('\u3000')
ss = line.split('\n') # 将每行内容根据=分割
if(ss != ['']):
ss2.append(ss)
while [] in ss2:
ss2.remove([''])
ss2.pop()
name=['title','high_risk_areas','high_areas','low_risk_areas','low_areas']
new_ss2=[];
list=[];
for i in range(0,len(ss2)):
new_ss2.append(ss2[i][0])
list.append(new_ss2)
saveCSV(name,list)
fp.close()
main()
该爬虫的思路是,直接爬取网页中展示出来的所有HTML内容,再从中筛选出来我们想要的字段。将筛选出来的字段存入txt文件中,接着再从txt文件中读取数据转存进csv文件中。
接下来是的是将csv文件数据导入mysql中:
import pymysql
import csv
import codecs
def get_conn():
conn = pymysql.connect(host='localhost', port=3307, user='root', passwd='123456', db='china_covid-19', charset='utf8')
return conn
def insert(cur, sql, args):
cur.execute(sql, args)
def read_csv_to_mysql(filename):
with codecs.open(filename=filename, mode='r', encoding='utf-8') as f:
reader = csv.reader(f)
head = next(reader)
conn = get_conn()
cur = conn.cursor()
sql = 'insert into list_of_grade_risk_areas values(%s,%s,%s,%s,%s,%s)'
cur.execute("Delete from list_of_grade_risk_areas where 1=1")
conn.commit()
for item in reader:
args = tuple(item)
insert(cur, sql=sql, args=args)
conn.commit()
cur.close()
conn.close()
if __name__ == '__main__':
read_csv_to_mysql('C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/list_of_grade_risk_areas.csv')
我的各地区风险评定等级的mysql表结构:
上面就是我使用爬虫爬取各地区风险评定等级的过程,记录了爬虫爬取数据,以及将数据转换成csv文件存储进mysql中。希望能对你有一些小小的帮助,一起努力一起进步!
python爬虫爬取百度今日热搜榜
咱爬百度热搜榜单的网址如下:http://top.baidu.com/buzz?b=1&fr=20811,这个网址的接口稳定不变,咱写好一次程序就可以安然的爬取数据就行。代码如下:
爬取百度热搜榜数据放入csv文件中:
# 爬取百度热搜榜数据放入csv文件中
# 导入相关库
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
def get_html(url, headers):
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
return r.text
def get_pages(html):
soup = BeautifulSoup(html, 'html.parser')
all_topics = soup.find_all('tr')[1:]
data=[];
for each_topic in all_topics:
topic_times = each_topic.find('td', class_='last') # 搜索指数
topic_rank = each_topic.find('td', class_='first') # 排名
topic_name = each_topic.find('td', class_='keyword') # 标题目
if topic_rank != None and topic_name != None and topic_times != None:
topic_rank = each_topic.find('td', class_='first').get_text().replace(' ', '').replace('\n', '')
topic_name = each_topic.find('td', class_='keyword').get_text().replace(' ', '').replace('\n', '')
topic_times = each_topic.find('td', class_='last').get_text().replace(' ', '').replace('\n', '')
# print('排名:{},标题:{},热度:{}'.format(topic_rank,topic_name,topic_times))
tplt = "排名:{0:^4} 标题:{1:{3}^15} 热度:{2:^8}"
data.append(tplt.format(topic_rank, topic_name, topic_times, chr(12288)))
test = pd.DataFrame(columns=['news'], data=data) # 数据有三列,列名分别为one,two,three
# datetimes = time.strftime("%Y-%m-%d", time.localtime(time.time()))
# fname = "./" + datetimes + r"baidu_hot_search.csv"
fname = "C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/" + r"baidu_hot_search.csv"
test.to_csv(fname, encoding='utf-8')
def main():
# 百度热点排行榜单链接
url = 'http://top.baidu.com/buzz?b=1&fr=20811'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'}
html = get_html(url, headers)
get_pages(html)
if __name__ == '__main__':
main()
代码的思路是,先爬取到网页上的数据,将其转换成json格式,然后将json格式的数据转存进csv文件中,csv文件可以很方便的导入mysql中使csv文件数据能够直接变成mysql数据库数据。
接下来是的是将csv文件数据导入mysql中:
#将csv文件导入到mysql中
import pymysql
import csv
import codecs
def get_conn():
conn = pymysql.connect(host='localhost', port=3307, user='root', passwd='123456', db='china_covid-19', charset='utf8')
return conn
def insert(cur, sql, args):
cur.execute(sql, args)
def read_csv_to_mysql(filename):
with codecs.open(filename=filename, mode='r', encoding='utf-8') as f:
reader = csv.reader(f)
head = next(reader)
conn = get_conn()
cur = conn.cursor()
sql = 'insert into baidu_hot_search values(%s,%s)'
cur.execute("Delete from baidu_hot_search where 1=1")
conn.commit()
for item in reader:
args = tuple(item)
insert(cur, sql=sql, args=args)
conn.commit()
cur.close()
conn.close()
if __name__ == '__main__':
read_csv_to_mysql('C:/Users/DELL/Desktop/machine_learning/COVID-19_crawler/baidu_hot_search.csv')
我的百度热搜榜的mysql表结构:
至此我介绍了编写爬虫去爬取百度热搜榜的数据并将其放入csv文件中,还介绍了如何将csv文件数据存储进mysql中,你的mysql表结构都建成跟我一样的才能够实现数据的导入,不然你得修改代码才能够实现导入。希望我能为你解惑,感谢你的阅读!
总结:
本篇文章中介绍了如何利用爬虫爬取网页数据,并且将其放入mysql数据库中。中国新冠疫情大数据展示系统我后端我利用的技术是php,前端我是用的是jquery+Echarts进行大屏展示,前后端交互我使用的是ajax。上面的代码足够你爬到想要的数据了,后端你还可以选择JAVA之类的来实现,数据就是石油,有了石油,小车就靠你们自己去造了!希望本篇文章能对你有小小的帮助,我是黑马Jack,编程这条路上我们一起披荆斩棘,不断前进!
- -黑马Jack