毕业设计——前期数据爬虫

爬虫的流程步骤:

确定需求:要爬取什么信息

寻找需求:在哪个网站上爬取?网站格式是啥?

发送请求:开启爬虫,发送请求

解析数据:提取数据

存储数据:存储

环境介绍:

系统:Win10

版本:Python3.7

IDE: Pycharm

Requests库安装使用:

获取网页大部分是GET参数跟随url,提交表单大部分为POST,参数不需要跟随URL

安装:

pip install requests

判断是否安装成功:

pip show requests

成功会出现:

使用GET方法请求WWW.BAIDU.COM

import requests  #引入包

url = 'http://www.baidu.com'  #定义要GET的URL

res = requests.get(url)  #调用get方法获得结果集

print(res.content)    #打印文本
print(res.status_code)  #打印状态码,200为正常
print(res.request.headers)  #请求头

 有时会遇到网站反扒不可启动爬虫,我们需要自定义Agent-headers来逃离网站监视:

import requests

url = 'https://www.xicidaili.com/nn/'    #此网站不可爬,原因是User-Agent在上述中为Python请求,需要将其改为浏览器

headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'
}     #定义更改为浏览器,此时可成功爬取

res = requests.get(url=url,headers=headers)
code = res.status_code
print(code)

if code == 200 :
    with open('./test.html','w',encoding='utf-8') as fp:
        fp.write(res.text)

Post方法进行请求:

import requests

url = 'https://fanyi.baidu.com/?aldtype=16047#auto/zh'

headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'
}
data ={
    'kw':'你好'
}


res = requests.post(url = url,headers = headers,data=data)

print(res.status_code)


if res.status_code ==200:
    with open('./fanyi.html','w',encoding='utf-8') as fp:
        fp.write(res.text)

总代码:

爬取CSDN上的首页推荐信息:

使用beautifulsoup来解析文章,headers头部伪装成自己浏览器标识,会对应输出文章标题、作者姓名、摘要。

import requests
from bs4 import BeautifulSoup
url = 'https://www.csdn.net/'

headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362',
    'Cookie': 'Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_18807340900-1577241599852-974584!5744*1*caicai779369786; uuid_tt_dd=10_18807340900-1577241599852-974584; UserNick=cs_yougar; dc_session_id=10_1577241599852.972347; UserName=caicai779369786; UserToken=a69c341ae7324fc485d7e644984332ea; searchHistoryArray=%255B%2522xpath%2522%252C%2522%25E7%2588%25B1%25E5%25A5%2587%25E8%2589%25BA%25E8%25BD%25AC%25E7%25A0%2581%2522%252C%2522dos%2520%25E8%25A7%2586%25E9%25A2%2591%25E8%25BD%25AC%25E7%25A0%2581%2522%252C%2522%25E7%2588%25B1%25E5%25A5%2587%25E8%2589%25BA%25E8%25BD%25AC%25E6%258D%25A2mp4%25E6%25A0%25BC%25E5%25BC%258F%2522%252C%2522python%25E5%259F%25BA%25E7%25A1%2580%25E6%2595%2599%25E7%25A8%258B%2522%255D; BT=1577862438892; p_uid=U000000; UserInfo=a69c341ae7324fc485d7e644984332ea; AU=69D; UN=caicai779369786; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1579057054,1579057582,1579060359,1579231198; dc_tos=q48ela; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1579231198; announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblog.csdn.net%252Fblogdevteam%252Farticle%252Fdetails%252F103603408%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D'
}
res = requests.get(url=url,headers=headers)
code = res.status_code
print(code)
soup =BeautifulSoup(res.text,"html.parser")
name =soup.select('#feedlist_id > li > div > div.title > h2 > a')
writer=soup.select('#feedlist_id > li > div > dl > dd > a')
zhaiyao=soup.select('#feedlist_id > li > div > div.summary.oneline')
for name in name :
    print(name.get_text())
for writer in writer :
    print(writer.get_text())
for zhaiyao in zhaiyao :
    print(zhaiyao.get_text())

 为了方便,提供给大家我整理的数据集:(审核中,请等待)

发布了15 篇原创文章 · 获赞 8 · 访问量 915

猜你喜欢

转载自blog.csdn.net/caicai779369786/article/details/103921656
今日推荐