爬虫基础——urllib模块

urllib2(urllib.request)和urllib模块的基本使用

注意: 在python3.3后urllib2已经不能再使用，只能用urllib.request来代替

1.1 在python3中，urlopen默认是get请求

import urllib.request

#urlopen()函数,url是必须要传入的,data如果传入就是POST请求,如果不传就是GETT请求
response = urllib.request.urlopen("https://www.baidu.com/")
#到服务器返回的数据,读取里面的全部内容
response_data = response.read()
#打印返回的数据
print(response_data.decode("utf-8"))

下面是POST请求代码

import urllib.request

#urlopen()函数,url是必须要传入的,data如果传入就是POST请求,如果不传就是GETT请求
response = urllib.request.urlopen("http://www.baidu.com/", data="s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=尚硅谷".encode("utf-8"))
#到服务器返回的数据,读取里面的全部内容
response_data = response.read()
#打印返回的数据
print(response_data.decode("utf-8"))

1.2. Request-封装请求头信息

User-Agent 反爬虫的第一步

在第一个例子里，urlopen()的参数就是一个url地址；

但是如果需要执行更复杂的操作，比如增加HTTP报头，必须创建一个 Request 实例来作为urlopen()的参数；而需要访问的url地址则作为 Request 实例的参数。

from urllib.request import Request,urlopen

header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
request = Request("http://www.baidu.com",headers=header)
#urlopen()函数,url是必须要传入的,data如果传入就是POST请求,如果不传就是GETT请求
response = urlopen(request)
#到服务器返回的数据,读取里面的全部内容
response_data = response.read()
#打印返回的数据
print(response_data.decode("utf-8"))

1.3 随机添加User-Agent

反爬虫第二步

目的就是模拟不同的客户端，让服务器以为是不同的用户，不封ip

import random
from urllib.request import Request,urlopen

url = "http://www.atguigu.com"

ua_list = [
    "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ",
    "Mozilla/5.0 (Macintosh; Intel Mac OS... "
]

user_agent = random.choice(ua_list)
#封装请求信息
request = Request(url)
#也可以通过调用Request.add_header() 添加/修改一个特定的header
request.add_header("User-Agent", user_agent)
# 第一个字母大写，后面的全部小写
request_info = request.get_header("User-agent")
print("request_info==",request_info)
#打开连接，请求数据
response = urlopen(request)
#把返回的数据全部读取完
html = response.read()

1.4 Url编码后再请求路径

from urllib import parse

if __name__ == "__main__":
   kw = input("请输入你要爬取的贴吧名称：")
   start_page = int(input("请输入起始页："))
   end_page = int(input("请输入结束页面："))
   kw = {"kw":kw}
   #转换成url编码
   kw = parse.urlencode(kw)
   url = "https://tieba.baidu.com/f?"
   url = url + kw
   print(url)

1.5 代码访问带有CA认证的网站

urllib.request在访问的时候则会报出SSLError

from urllib.request import Request,urlopen
# 1. 导入Python SSL处理模块
import ssl
# 2. 表示忽略未经核实的SSL证书认证
context = ssl._create_unverified_context()
# 3.请求的路径
url = "https://www.12306.cn/mormhweb/"
# 4.模拟浏览器的请求头信息
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# 5.传入请求路径和请求头新
request = Request(url, headers = headers)
# 6. 发起请求，并且忽略ssl验证
response = urlopen(request,context=context)

print(response.read().decode("utf-8"))

1.6. ProxyHandler处理器（代理设置）

使用代理IP(建议私密代理)，这是爬虫/反爬虫的第三大招，通常也是最好用的。

· 西刺免费代理IP：http://www.xicidaili.com/

· 快代理免费代理：https://www.kuaidaili.com/free/inha/

· 全网代理IP：http://http.zhiliandaili.com/

from urllib.request import build_opener, ProxyHandler, Request

#创建私密代理处理对象(代理处理器)
proxy_hander = ProxyHandler({"http":"username:[email protected]:16819"})

#创建一个opper对象
opener = build_opener(proxy_hander)

#创建一个Request对象,并且把地址传入
request = Request("http://www.baidu.com/")

#返回响应信息
reponse = opener.open(request)
#打印信息
print(reponse.read().decode("utf-8"))

随机选择ip：

from urllib.request import ProxyHandler,Request,build_opener
import random

#参数是字典类型,http是key,值是ip地址
#如果没有代理的服务器也要写
proxy_list = [
    {"http" : "122.72.18.35:80"},
   {"http" : "122.72.18.34:80"},
   {"http" : "203.174.112.13:3128"},
]

# 随机选择一个代理
proxy = random.choice(proxy_list)# 使用选择的代理构建代理处理器对象
print(proxy)
http_proxyhander = ProxyHandler(proxy)

null_proxy_hander = ProxyHandler({})

proxy_switch = True
if proxy_switch:
   #返回一个opener对象
   opener = build_opener(http_proxyhander)
else:
   opener = build_opener(null_proxy_hander)
#构建请求对象
request = Request("http://www.baidu.com/")
#返回数据
response = opener.open(request)

#请求返回到的数据
print(response.read())