前面我们学习了urllib库进行源代码的爬取,今天来介绍更加人性化的requests库的使用。
import requests
'''response=requests.get("https://baidu.com/")
print(response.text)#获取源代码方法1
print(response.content.decode("utf-8"))#获取源代码方法2
****注意:****response.content是直接爬取的源代码没有经过解码的代码,故是bytes类型,而response.text则是经过requets库自己的猜测进行的解码,有时可出现乱码,所以此时应该用response.content.decode(“utf-8”)
基本属性:
print(response.url) #查看url地址
print(response.encoding) #查看响应头部字符编码
print(response.status_code) #查看响应码
‘’’
#get请求用params, post请求用data
注意:可以如此使用请求:
1.requests.get(“https://baidu.com/”)
2.requests.post(“https://baidu.com/”)
例子1:对于“百度一下#深圳#”的源代码(get请求实例):
params={"wd":"深圳"}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"}
****response=requests.get("https://baidu.com/s",params=params,headers=headers)****
with open("baidu.html","w",encoding='utf-8')as fp:#保存到当地文件
fp.write(response.content.decode("utf-8"))
print(response.url)
例子2:对于“拉钩网#python#”的源代码(get请求实例)
import requests
data={"first": "true" ,
"pn": "1" ,
"kd": "python" }
headers={"Referer":" https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"}
response=requests.post("https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false",data=data,headers=headers)
print(response.json)****#response.json可以将json源代码转换成字典或者列表的形式。****
*
requests库使用代理IP*
import requests
proxy={"HTTP":"114.226.246.144:9999"}
response=requests.get("http://httpbin.org/ip",proxies=proxy)
print(response.text)
#此处本人在实现时发现一直无法成功代理,在确保代码无误之后觉得是使用免费代理的缘故,所以此处应多次尝试更换代理IP或者使用有偿ip
requests处理cookie信息
import requests
response=requests.get("https://baidu.com/")
print(response.cookies)#*可通过这样获取cookie信息*
print(response.cookies.get_dict())#*具体的cookie(字典形式)*
**如果在多次使用中共享cookie信息,那么应该使用session**
url="http://www.renren.com/"
data={"email":"135*********","password":"***08***"}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"}
session=requests.session()
session.post(url,data=data,headers=headers)
response=session.get("http://www.renren.com/973687886/profile")
with open("renren.html","w",encoding="utf-8") as fp:
fp.write(response.text)`
处理不信任的SSL证书:(爬取某些网页无法进入时)
response=requests.get("http://*******",verify=False)
print(response.content.decode("utf-8"))
```