python爬虫（一）Urllib使用

爬虫介绍

网络爬虫就是一个爬行程序，一个抓取网页的程序。网络爬虫的基本操作是抓取网页，但爬虫概念包括抓取和数据解析两个部分。
爬虫是通过网页的链接地址来寻找网页的。从网站某一个页面（通常是首页）开始，读取网页的内容，找到在网页中的其它链接地址，然后通过这些链接地址寻找下一个网页，这样一直循环下去，直到把这个网站所有的网页都抓取完为止。

爬虫意义

爬虫可以完成很多事情，如：

爬取静态页面
分析并推送价值数据
资源的批量下载
各类数据监控
社会计算的统计预测
机器翻译语料库
机器学习训练库

Urllib库

Urllib提供了基础的python爬虫爬取操作，我们可以使用Urllib库完成简单的数据请求和网页信息抓取。

简单的数据请求

import urllib.request

response=urllib.request.urlopen('https://www.baidu.com')
print(response.read().decode('utf-8'))

带参数的数据请求

import urllib.request as url_req
import urllib.parse as url_pa

data=url_pa.urlencode({'query':'ai'})
url='https://www.sogou.com/web?'

request = url_req.Request(url)
#参数需要转为流的形式做传递
response = url_req.urlopen(request,data.encode('utf-8'))
print(response.read().decode('utf-8'))

import urllib.parse
import time

headers = {
    'Host': 'www.budejie.com',
    'Referer': 'http://www.budejie.com',
    'User_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

for i in range(1, 6):
    req = urllib.request.Request('http://www.budejie.com/' + str(i), headers=headers)
    res = urllib.request.urlopen(req)
    html = res.read().decode('utf-8')
    with open('get/' + str(i) + '.html', 'w', encoding='utf-8') as f:
        f.write(html)
    time.sleep(3)
    print('第%d页,长度%d' % (i, len(html)))
    headers['referer'] = 'http://www.budejie.com/' + str(i)

通过代理发起请求

import urllib.request

req = urllib.request.Request('https://www.baidu.com')
proxy={'http':'119.29.12.129'}
proxy_handler=urllib.request.ProxyHandler(proxy)
#顺手设置一个debug级别日志
http_handler=urllib.request.HTTPHandler(debuglevel=1)
opener=urllib.request.build_opener(proxy_handler, http_handler)
response=opener.open('http://www.baidu.com')
print(len(response.read().decode('utf-8')))

请求异常捕获

请求中常见的异常有两类，URLError与HTTPError

urllib.error.URLError

URLError通常与环境有关，具体原因如下：

网络无连接，本机无法上网
无法正常连接到服务器
服务器不存在，域名无法解析

urllib.error.HTTPError

HTTPError是URLError的子类，当请求顺利接收到服务器返回的响应且状态码异常时，会抛出该错误

异常捕获逻辑

由于两者存在父子关系，按照惯例应首先捕获子类异常，捕获逻辑如下

import urllib.request
import urllib.error
req = urllib.request.Request('http://www.douyu.com/Jack_Cui.html')
try:
    print(urllib.request.urlopen(req).read().decode('utf-8'))
except urllib.error.HTTPError as e:
    print(e.reason)
    print(e.code)
except urllib.error.URLError as e:
    print(e.reason)
else:
    print('ok')