三、学习分布式爬虫的第三天

requests库基本使用（第三方库）

虽然python的标准库中的urllib模块已经包含了平常我们使用的大多数功能，但是它的API使用起来让人感觉不太好，而Requests宣传“HTTP for Humans”，说明使用更简便。
Requests使用python语言编写，基于urllib，但是它比urllib更加方便，可以节约我们大量的工作，完全满足HTTP测试需求。
安装：pip install requests
发送get请求

import requests
#添加header和查询参数
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
}
#params 接收一个字典或字符串的查询参数，字典类型自动转换为url编码，无需urlencode
kw = {
    'wd':'中国'
}
#发送get请求
response = requests.get('http://www.baidu.com/s',headers=headers,params=kw)
print(response) #<Response [200]>
print(response.url)
print(response.text)  #字符串形式
print(response.content) #字节流形式

发送post请求

resp = requests.post(url,data=字典,headers=headers)

使用代理
在requests库中使用代理，只要在请求的方法中（get，post）传递proxies参数就行

#未使用代理
# import requests
# url = 'http://www.httpbin.org/ip'
# resp = requests.get(url)
# print(resp.text)  #"origin": "111.29.161.238"

#使用代理
import requests
proxy = {
    'http':'http://182.101.207.11:8080',
}
url = 'http://www.httpbin.org/ip'
resp = requests.get(url,proxies=proxy)
print(resp.text)   #"origin": "182.101.207.11"

requests库处理cookie
如果在一个响应中包含了cookie，那么可以利用cookies属性拿到这个返回的cookie值

import requests
#requests中获取cookie
import requests
resp = requests.get('http://www.baidu.com')
print(resp.cookies)
print(resp.cookies.get_dict())

requests库利用session共享cookie
session：使用requests也可以达到共享cookie的目的，那就是requests库提供的session对象。注意：这里的session不是web开发中的session，这里只是一个会话的对象而已。

#使用requests库中的session实现共享cookie
import requests
#登录页面网址
post_url = 'https://i.meishi.cc/login_t.php?redirect=https%3A%2F%2Fi.meishi.cc%2Fcook.php%3Fid%3D14417636%26session_id%3Da04cb701c24a3c81e9891fadc13e56ae'
post_data = {
    'username':'157*****414',
    'password':'*****'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

}
sess = requests.session() #实例化一个session对象
sess.post(post_url,post_data,headers=headers) #登录后sess已经拥有cookie

#个人页面网址
url = 'https://i.meishi.cc/cook.php?id=14417636'
resp = sess.get(url,headers=headers) #利用sess共享cookie实现对个人网页的爬取
print(resp.text)

通过requests库处理不被信任的SSL证书
SSL证书：类似于营业执照
在请求的方法中加入verify=False即可

import requests
url = 'https://inv-veri.chinatax.gov.cn/index.html'
resp = requests.get(url,verify=False)
print(resp.content.decode('utf-8'))

Mr_Little_li

发布了4 篇原创文章 · 获赞 0 · 访问量 352

私信关注

三、学习分布式爬虫的第三天

requests库基本使用（第三方库）

猜你喜欢