Python crawler learning 17
write directory title here
-
Advanced usage part 2
-
Session maintenance
# 之前我们学过 post 与 get方法做到模拟网页进行请求,这两种方法是相互独立的,即相当于两个浏览器打开了不同的页面 # 基于以上特点,我们使用爬虫时,用 POST 方法登录网站之后,再想要使用get方法获取请求个人信息页面显然不能得到我们想要的信息,那么如何解决这种问题呢? # 方法一 在两次请求中传入相同的cookie参数 # 方法二 利用Session对象,进行session维护
Example:
import requests r0 = requests.get('https://www.httpbin.org/cookies/set/number/123456789') # 在我们设置cookie并成功获得请求后,再次向该网站请求 print(r0.text) r1 = requests.get('https://www.httpbin.org/cookies') # 可以发现返回的cookies字段为空 print(r1.text)
operation result:
Using the Session object:
# session 维持 import requests s = requests.session() r0 = s.get('https://www.httpbin.org/cookies/set/number/123456789') r1 = s.get('https://www.httpbin.org/cookies') print(r0.text) print(r1.text)
operation result:
-
SSL certificate verification
# 现在很多网站要求使用HTTPS协议,但是有些网站可能没有设置好HTTPS证书,或者网站的HTTPS 证书可能不被CA机构认可,这是这些网站可能会出现SSL证书错误的提示。 # 例如我们访问这个网站 https://ssr2.scrape.center/ # 就会有如下提示
Let's use the requests library to request such a website:
import requests resp = requests.get('https://ssr2.scrape.center/') print(resp.status_code) # 嗯?怎么回事?怎么不让我们访问
Run result: no result... throws us an SSLError error
Set verify parameter to bypass verification
import requests # verify 参数设置为 True(默认值) 时会自动验证,反之则不会进行验证 resp = requests.get('https://ssr2.scrape.center/',verify=False) print(resp.status_code)
In this way, the status code can be obtained:
But we found that the program still gave a warning, expecting us to specify a certificate
Set ignore warnings to suppress warnings
import requests from requests.packages import urllib3 urllib3.disable_warnings() resp = requests.get('https://ssr2.scrape.center/', verify=False) print(resp.status_code)
operation result:
Specify certificate to bypass warning
To set up the certificate, you need to have crt and key files locally, and specify their paths, and the key of the local private certificate must be decrypted.
Don't look at me, like I can skip this step, I don't
# 格式 import requests resp = requests.get('https://ssr2.scrape.center/', cert=('path/**.crt', 'path/**.key'))
-
Today ends, to be continued...