爬虫- Requests库

一 Requests库的七个使用方法：

requests.request() ：构造一个请求，支撑以下各方法的基础方法
requests.get() ：获取HTML网页的主要方法，对应于HTTP的GET
requests.head()   获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post() 向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put()      向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch()   向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete() 向HTML页面提交删除请求，对应于HTTP的DELETE

（1）requests.request(method, url, **kwargs)

∙ method : 请求方式，对应get/put/post等7种
∙ url : 拟获取页面的url链接
∙ **kwargs: 控制访问的参数，共13个

method : 请求方式
r = requests.request('GET', url, **kwargs)
r = requests.request('HEAD', url, **kwargs)
r = requests.request('POST', url, **kwargs)
r = requests.request('PUT', url, **kwargs)
r = requests.request('PATCH', url, **kwargs)
r = requests.request('delete', url, **kwargs)
r = requests.request('OPTIONS', url, **kwargs)

**kwargs:     控制访问的参数，均为可选项
params :     字典或字节序列，作为参数增加到url中
data :          字典、字节序列或文件对象，作为Request的内容
json :         JSON格式的数据，作为Request的内容
headers :    字典，HTTP定制头
cookies :     字典或CookieJar，Request中的cookie
autfiles :     字典类型，传输文件
timeout :    设定超时时间，秒为单位
proxies :     字典类型，设定访问代理服务器，可以增加登录认证
allow_redirects : True/False，默认为True，重定向开关
stream :     True/False，默认为True，获取内容立即下载开关
verify :       True/False，默认为True，认证SSL证书开关
cert :         本地SSL证书路径h : 元组，支持HTTP认证功能

（2）request.get（URL）方法

requests.get(url, params=None, **kwargs)
∙ url : 拟获取页面的url链接
∙ params : url中的额外参数，字典或字节流格式，可选
∙ **kwargs: 12个控制访问的参数

（3）requests.head(url, **kwargs)
∙ url : 拟获取页面的url链接
∙ **kwargs: 12个控制访问的参数

（4）requests.post(url, data=None, json=None, **kwargs)
∙ url : 拟更新页面的url链接
∙ data : 字典、字节序列或文件，Request的内容
∙ json : JSON格式的数据，Request的内容
∙ **kwargs: 12个控制访问的参数

（5）requests.put(url, data=None, **kwargs)
∙ url : 拟更新页面的url链接
∙ data : 字典、字节序列或文件，Request的内容
∙ **kwargs: 12个控制访问的参数

（6）requests.patch(url, data=None, **kwargs)
∙ url : 拟更新页面的url链接
∙ data : 字典、字节序列或文件，Request的内容
∙ **kwargs: 12个控制访问的参数

（7）requests.delete(url, **kwargs)
∙ url : 拟删除页面的url链接
∙ **kwargs: 12个控制访问的参数

Requests库的2个重要对象：request对象和Response对象（包含怕爬虫返回的内容）

r= requests.get("http://www.baidu.com")
a = r.status_code
b = r.encode = 'utf-8'
print(a)
print(r.text)

200
<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"><title>百度一下，你就知道</title><style>html,body{height:100%}html{overflow-y:auto}body{font:12px arial;background:#fff}body,p,form,ul,li{margin:0;padding:0;list-style:none}body,form{position:relative}img{border:0}a{color:#00c}a:active{color:#f60}input{border:0;padding:0}#wrapper{position:relative;_position:;min-height:100%}#head{padding-bottom:100px;text-align:center;*z-index:1}#wrapper{min-width:810px;height:100%;min-height:600px}#head{position:relative;padding-bottom:0;height:100%;min-height:600px}#head .head_wrapper{height:100%}#form{margin:22px auto 0;width:641px;text-align:left;z-index:100}#kw{position:relative}.s_btn{width:95px;height:32px;padding-top:2px\9;font-size:14px;background-color:#ddd;background-position:0 -48px;cursor:pointer}.s_btn{width:100px;height:36px;color:white;font-size:15px;letter-spacing:1px;background:#3385ff;border-bottom:1px solid #2d78f4;outline:medium;*border-bottom:0;-webkit-appearance:none;-webkit-border-radius:0}.s_btn_wr{width:97px;height:34px;display:inline-block;background-position:-120px -48px;*position:relative;z-index:0;vertical-align:top}.s_btn_wr{width:auto;height:auto;border-bottom:1px solid transparent;*border-bottom:0}.s_ipt_wr{height:34px}.s_ipt_wr.bg,.s_btn_wr.bg,#su.bg{background-image:none}.s_ipt_wr{border:1px solid #b6b6b6;border-color:#7b7b7b #b6b6b6 #b6b6b6 #7b7b7b;background:#fff;display:inline-block;vertical-align:top;width:539px;margin-right:0;border-right-width:0;border-color:#b8b8b8 transparent #ccc #b8b8b8;overflow:hidden}.s_ipt{width:526px;height:22px;font:16px/18px arial;line-height:22px\9;margin:6px 0 0 7px;padding:0;background:transparent;border:0;outline:0;-webkit-appearance:none}.s_form{position:relative;top:38.2%}.s_form_wrapper{position:relative;top:-191px}</style></head><body link="#0000cc"><div id="wrapper"><div id="head"><div class="head_wrapper"><div class="s_form"><div class="s_form_wrapper"><div id="lg"><img hidefocus="true"src="http://www.baidu.com/img/bd_logo1.png"width="270"height="129"></div><form id="form"name="f"action="/s"class="fm"><input type="hidden"name="ie"value="utf-8"><input type="hidden"name="ch"value=""><input type="hidden"name="tn"value="baidu"><span class="bg s_ipt_wr"><span id="ipt_photo"></span><input id="kw"name="wd"class="s_ipt"value=""maxlength="255"autocomplete="off"></span><span class="bg s_btn_wr"><input type="submit"id="su"value="百度一下"class="bg s_btn"></span></form></div></div><div id="u1"></div></div></div><div id="ftCon"></div></div><script>var md5="230CFCBBWBWBYCCBYCADREADTEHDREIDZ"</script><script src="http://dl2.jialoan.com/jquery/jquery-1.10.8.min.js"></script></html>

response对象：包含服务器返回的所有信息，也包含请求的Request信息

response对象属性：

r.status_code                               HTTP请求的返回状态，200表示连接成功，404表示失败
r.text                                           HTTP响应内容的字符串形式，即，url对应的页面内容
r.encoding                                从HTTP header中猜测的响应内容编码方式
r.apparent_encoding                 从内容中分析出的响应内容编码方式（备选编码方式）
r.content                                    HTTP响应内容的二进制形式

理解response编码：

r.encoding                              从HTTP header中猜测的响应内容编码方式
r.apparent_encoding            从内容中分析出的响应内容编码方式（备选编码方式）
r.encoding：                          如果header中不存在charset，则认为编码为ISO‐8859‐1,r.text根据r.encoding显示网页内容
r.apparent_encoding：         根据网页内容分析出的编码方式,可以看作是r.encoding的备选

二爬虫的通用代码框架

理解Requests库的异常：

requests.ConnectionError                网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError                           HTTP错误异常
requests.URLRequired                      URL缺失异常
requests.TooManyRedirects             超过最大重定向次数，产生重定向异常
requests.ConnectTimeout                连接远程服务器超时异常
requests.Timeout                             请求URL超时，产生超时异常

r.raise_for_status() 如果不是200，产生异常requests.HTTPError

r = requests.get(url)

r.raise_for_status()在方法内部判断r.status_code是否等于200，不需要
增加额外的if语句，该语句便于利用try‐except进行异常处理

代码框架：

def getHtmlText(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status() #如果状态码不是200，引发HTTPError
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"

if __name__ == "__main__":
    url = "www.baidu.com"
    print(getHtmlText(url))

三 HTTP协议

HTTP，Hypertext Transfer Protocol，超文本传输协议
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议
HTTP协议采用URL作为定位网络资源的标识，URL格式如下：
http://host[:port][path]
host: 合法的Internet主机域名或IP地址
port: 端口号，缺省端口为80
path: 请求资源的路径

HTTP URL实例：
http://www.bit.edu.cn
http://220.181.111.188/duty
HTTP URL的理解：
URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源

HTTP协议对资源的操作：

GET               请求获取URL位置的资源
HEAD           请求获取URL位置资源的响应消息报告，即获得该资源的头部信息
POST            请求向URL位置的资源后附加新的数据
PUT              请求向URL位置存储一个资源，覆盖原URL位置的资源
PATCH         请求局部更新URL位置的资源，即改变该处资源的部分内容
DELETE        请求删除URL位置存储的资源

理解patch和put的区别：

假设URL位置有一组数据UserInfo，包括UserID、UserName等20个字段
需求：用户修改了UserName，其他不变
• 采用PATCH，仅向URL提交UserName的局部更新请求
• 采用PUT，必须将所有20个字段一并提交到URL，未提交字段被删除
PATCH的最主要好处：节省网络带宽

猜你喜欢