【Python爬虫】Urllib的使用（2）

写在前面

这是第二篇介绍爬虫基础知识的文章，之前的文章【Python爬虫】初识爬虫（1）主要是让大家了解爬虫和爬虫需要的基础知识，今天主要给大家介绍Urllib的使用。

什么是Urllib？

Urllib是Python自带的标准库，无需安装，直接可以用，且提供了以下功能：
网页请求
响应获取
代理和cookie设置
异常处理
URL解析
爬虫所需要的功能，基本上在Urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

我的爬虫环境是基于py3.x，这里在啰嗦一下py2.x和py3.x环境下 Urllib的区别。

py2.x环境下有
Urllib Urlli2
urllib与urllib2都是Python内置的，要实现Http请求，以urllib2为主,urllib为辅.

py3.x环境下有
Urllib

变化：

在Pytho2.x中import urllib2使用——-对应的，在Python3.x中会使用import urllib.request，urllib.error

在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse

在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse

在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen

在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode

在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote

在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar

在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request

详细介绍

Urlopen

urlopen的语法：

1urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
2#url:访问的网址
3#data:额外的数据，如header，form data

urlopen一般使用三个参数
urlopen(Url,data,timeout)

第一个参数URL必传的，第二个参数data是访问URL时要传送的数据，第三个timeout是设置超时时间，后面两个参数不是必传的。

用法

 1# request:GET
 2import urllib.request
 3response = urllib.request.urlopen('http://www.baidu.com')
 4print(response.read().decode('utf-8'))
 5
 6# request: POST
 7# http测试：http://httpbin.org/
 8import urllib.parse
 9import urllib.request
10data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
11response = urllib.request.urlopen('http://httpbin.org/post',data=data)
12print(response.read())
13
14# 超时设置
15import urllib.request
16response = urllib.request.urlopen('http://httpbin.org/get',timeout=1)
17print(response.read())

上面的代码中包含了两个比较常见且重要的请求方式：Get和Post，下面看一下具体的介绍

Get
urllib的request模块可以非常方便地抓取URL内容，也就是发送一个GET请求到指定的页面，然后返回HTTP的响应。

1import urllib.request
2response = urllib.request.urlopen('https://www.python.org')
3print(response.read().decode('utf-8'))

Post
以POST发送一个请求，只需要把参数data以bytes形式传入。

 1from urllib import request, parse
 2
 3url = 'http://httpbin.org/post'
 4headers = {
 5    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
 6    'Host': 'httpbin.org'
 7}
 8dict = {
 9    'name': 'Germey'
10}
11data = bytes(parse.urlencode(dict), encoding='utf8')
12req = request.Request(url=url, data=data, headers=headers, method='POST')
13response = request.urlopen(req)
14print(response.read().decode('utf-8'))

请求头headers处理

如果要模拟浏览器完成特定功能，需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求，再根据浏览器的请求头来伪装，User-Agent头就是用来标识浏览器的。可以看下面的案例来学习添加一个Headers。

 1from urllib import request, parse
 2
 3url = 'http://httpbin.org/post'
 4headers = {
 5    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
 6    'Host': 'httpbin.org'
 7}
 8dict = {
 9    'name': 'Germey'
10}
11data = bytes(parse.urlencode(dict), encoding='utf8')
12req = request.Request(url=url, data=data, headers=headers, method='POST')
13response = request.urlopen(req)
14print(response.read().decode('utf-8'))

看完请求我们再看一下响应。

 1# 响应类型
 2import urllib.open
 3response = urllib.request.urlopen('https:///www.python.org')
 4print(type(response))
 5# 状态码， 响应头
 6import urllib.request
 7response = urllib.request.urlopen('https://www.python.org')
 8print(response.status) #获取状态码
 9print(response.getheaders()) #获取所有响应头
10print(response.getheader('Server')) #获取指定响应头

上面代码中注释里已经对获取响应的方法做了介绍，为了使我们的爬虫性能更加完美我们需要添加Handler。

Handler：处理更加复杂的页面

使用代理的优点：
让服务器以为不是同一个客户段在请求，防止我们的真实地址被泄露，以下是添加一个代理的案例。

1import urllib.request
2
3proxy_handler = urllib.request.ProxyHandler({
4    'http': 'http://127.0.0.1:9743',
5    'https': 'https://127.0.0.1:9743'
6})
7opener = urllib.request.build_opener(proxy_handler)
8response = opener.open('http://httpbin.org/get')
9print(response.read())

但是使用一个代理频繁访问服务器的时候也会被判定为是爬虫，这个时候我们可以使用IP池增强我们爬虫的健壮性。

Cookie

Cookie，指某些网站为了辨别用户身份、进行session跟踪而储存在用户本地终端上的数据（通常经过加密）。比如说有些网站需要登录后才能访问某个页面，在登录之前，你想抓取某个页面内容是不允许的。那么我们可以利用Urllib库保存我们登录的Cookie，然后再抓取其他页面就达到目的了。

 1import http.cookiejar, urllib.request
 2   #获取cookie保存到变量
 3cookie = http.cookiejar.CookieJar()
 4handler = urllib.request.HTTPCookieProcessor(cookie)
 5opener = urllib.request.build_opener(handler)
 6response = opener.open("http://www.baidu.com")
 7for item in cookie:
 8    print(item.name+"="+item.value)
 9
10# 保存cooki为文本
11import http.cookiejar, urllib.request
12filename = "cookie.txt"
13# 保存类型有很多种
14## 类型1
15cookie = http.cookiejar.MozillaCookieJar(filename)
16## 类型2
17cookie = http.cookiejar.LWPCookieJar(filename)
18
19handler = urllib.request.HTTPCookieProcessor(cookie)
20opener = urllib.request.build_opener(handler)
21response = opener.open("http://www.baidu.com")
22
23# 使用相应的方法读取
24import http.cookiejar, urllib.request
25cookie = http.cookiejar.LWPCookieJar()
26cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
27handler = urllib.request.HTTPCookieProcessor(cookie)
28opener = urllib.request.build_opener(handler)
29response = opener.open("http://www.baidu.com")

异常处理

引入异常处理为了捕获异常，保证程序稳定运行，下面的例子可以教大家如何使用异常处理。

1# 访问不存在的页面，打印异常原因
2from urllib import request, error
3try:
4    response = request.urlopen('http://www.baibai.com/index.htm')
5except error.URLError as e:
6    print(e.reason)

上面的代码中访问了一个不存在的网站，可以打印异常的具体原因，下面的案例

 1#捕获的具体异常
 2from urllib import request, error
 3
 4try:
 5    response = request.urlopen('http://www.baidu.com')
 6except error.HTTPError as e:
 7    print(e.reason, e.code, e.headers, sep='\n')
 8except error.URLError as e:
 9    print(e.reason)
10else:
11    print('Request Successfully')

URL解析

urlparse:拆分URL

1from urllib import urlparse
2result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580")
3result
4##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='')

urlunparse:拼接URL，为urlparse的反向操作

1from urllib.parse import urlunparse
2data = ['http','www.baidu.com','index.html','user','a=7','comment']
3print(urlunparse(data))

urlencode:字典对象转换成GET请求对象

1from urllib.parse import urlencode
2
3params = {
4    'name': 'germey',
5    'age': 22
6}
7base_url = 'http://www.baidu.com?'
8url = base_url + urlencode(params)
9print(url)

写在最后：
知识点比较多，大家只需要做简单了解，毕竟有更好用的requests等着我们去学习呢。

【推荐阅读】

【LeetCode】贪心算法--划分字母区间（763）

python异常报错详解

机器学习实战--住房月租金预测（3）