Python爬虫之 urllib库

　　1、urllib库介绍

　　 urllib库是Python内置的请求库，能够实现简单的页面爬取功能。值得注意的是，在Python2中，有urllib和urllib2两个库来实现请求的发送。但在Python3中，就只有urllib库了。由于现在普遍流行只用Python3了，所以了解urllib库就行了。查看Python源文件知道urllib库包括5个模块，分别是：request、error、parse、robotparser、response。但我翻阅了一些资料后，发现robotparser和response很少提及，故我只对其他三个模块有所了解。

　　2、request模块

　　顾名思义，request就是用来发送请求的，我们可以通过设置参数来模拟浏览器发送请求。值得注意的是，此处request是一个urllib的一个子模块与另外一个请求库request要区分。本来在写这篇博客之前想仔细看看request模块的源码，打开发现有2700+行代码，遂放弃。

　　 request模块中主要是用urlopen()和Request()来发送请求和一些Handler处理器。下面用代码演示，具体用法在代码注释中。

　　urlopen()方法演示：

　　from urllib import request

　　from urllib import parse

　　from urllib import error

　　import socket

　　if __name__ == '__main__':

　　'''

　　def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,

　　*, cafile=None, capath=None, cadefault=False, context=None):

　　参数分析：

　　url:请求路径

　　data:可选;如果要添加这个参数，需要将字典格式的数据转化为字节流数据，并且请求方式从get变为post

　　timeout:可选;超时时间，如果访问超时了变会抛出一个异常

　　其他三个参数是用来设置证书和SSL的，默认设置即可

　　'''

　　# 一次简单的请求了

　　response_1 = request.urlopen(url="http://www.baidu.com") # 返回一个HttpResponse对象

　　print(response_1.read().decode("utf-8")) #这样就完成了一次简单的请求了

　　print("状态码:" , response_1.status)

　　print("请求头:" , response_1.getheaders())

　　print("----------------------------------华丽分割线-----------------------------------------------")

　　# 一次复杂的请求

　　dict = {"name" : "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　try:

　　response_2 = request.urlopen(url="http://www.httpbin.org/post",data=data,timeout=10)

　　except error.URLError as e:

　　if isinstance(e.reason,socket.timeout):

　　print("请求超时了")

　　print(response_2.read().decode('utf-8'))

　　使用Request构造请求体

　　from urllib import request,parse

　　if __name__ == '__main__':

　　"""

　　Request是一个类，通过初始化函数对其进行赋值，其作用是构造一个更强大的请求体

　　def __init__(self, url,

　　data=None, headers={},

　　origin_req_host=None,

　　unverifiable=False,

　　method=None):

　　url:请求路径

　　data:可选;如果要添加这个参数，需要将字典格式的数据转化为字节流数据

　　headers:可选;参数类型是一个字典。我们可以修改User-Agent来伪装成浏览器，可以防止反爬虫

　　origin_req_host:可选;设置主机IP

　　unverifiable:可选;表示请求是否是无法验证的

　　method:可选;指示请求方式是GET,POST,PUT

　　"""

　　dict = {"name": "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　headers = {郑州妇科在线医生 http://www.zzkdfk120.com/

　　"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"

　　} #伪装成corome浏览器

　　req = request.Request(url="http://www.httpbin.org/post",data=data,headers=headers,method="POST")

　　response = request.urlopen(req)

　　print(response.read().decode("utf-8"))

　　3、error模块

　　 error模块有两个子类：URLError和HTTPError

　　from urllib import request,error

　　if __name__ == '__main__':

　　try:

　　# 尝试打开一个不存在的网站

　　response_1 = request.urlopen(

　　except error.URLError as e:

　　print(e.reason)

　　try:

　　# 请求出现错误

　　response_2 = request.urlopen("http://www.baidu.com/aaa.html")

　　except error.HTTPError as e:

　　print(e.reason)

　　#若是报400，则表示网页不存在;若是报500，则表示服务器异常

　　print(e.code)

　　print(e.headers)

　　4、parse模块

　　urlparse()：对url字符串进行解析

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#锚点"

　　result = parse.urlparse(url=url)

　　print(result)

　　# 输出结果：

　　ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='param1', query='ie=UTF-8&wd=python', fragment='锚点')

　　urlunparse()： urlparse()的逆过程，传入一个长度为6的列表即可，列表的参数顺序与urlparse()的结果一致。

　　urlsplit()与urlunsplit() :与上述两个方法基本一致，只是将path和params的结果放在一起了

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#锚点"

　　result = parse.urlsplit(url=url)

　　print(result)

　　# 输出结果：

　　SplitResult(scheme='https', netloc='www.baidu.com', path='/s;param1', query='ie=UTF-8&wd=python', fragment='锚点')

　　其它的方法也是差不多的作用，都是对url进行解析的。

Python爬虫之 urllib库

猜你喜欢