Python urllib 模块基础使用

urllib ：URL处理模块

urllib 是一个收集几个模块以处理URL的包
包括：
urllib.request 用于打开阅读URLs
urllib.error 用于urllib.request过程中出现的错误
urllib.parse 用于解析URL
urllib.robotparser 用于解析robots.txt文件

urllib.request 定义了有助于处理HTTP的函数与类

urllib.request主要方法

urllib.request.urlopen

urllib.request.urlopen（url，data = None，[ timeout，] *，cafile = None，capath = None，cadefault = False，context = None ）
- url：可以是string，或者是Request对象
- data: 指定要发送到服务器的其他数据的对象，或者None。HTTP请求是唯一使用数据的请求。支持的对象类型包括字节，类文件对象和可迭代
- timeout：设置超时时间(以秒为单位)，如果没有设定以全局默认为准。该字段仅适用于HTTP, HTTPS， FTP
- cafile，capath参数为HTTPS请求指定一组可信CA证书，从版本3.6 开始不推荐使用。可使用 ssl.SSLContext.load_cert_chain() 改用，或者 ssl.create_default_context() 为系统选择可信CA证书
- cadefault，忽略参数
- context，如果指定context，它必须是ssl.SSLContext描述各种SSL选项的实例
该函数返回一个对象，该对象可用作上下文管理器并具有诸如的方法：
- geturl() - 返回检索到的资源的URL
- info()- 以email.message_from_string()实例的形式返回页面的元信息，例如标题
- getcode() - 返回响应的HTTP状态代码
对于HTTP和HTTPS URL，此函数返回http.client.HTTPResponse稍微修改的对象

from urllib.request import urlopen

abc = urlopen(r"http://python.org/")
print(abc.geturl())
print(abc.info())
print(abc.getcode())

urllib.request还提供了以下主要类

Request 该类是url请求的抽象类

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
- url：string类型的有效url
- data：指定要发送到服务器的其他数据的对象，或者None
- headers：访问url时的头部信息，必须是字典类型
- origin_req_host：原始请求方的主机名或IP地址
- unverifiable：如果没有权限访问请求结果，该参数为True。默认访问权限为False
- method：指定请求的方法，如GET，POST，PUT等

from urllib import request

url = (r"http://python.org/")
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'Connection': 'keep-alive',
}
req = request.Request(url, headers=headers)
page = request.urlopen(req).read()
page = page.decode('utf-8')
print(page)

urllib.error定义了urllib.request引发的异常类

包含的异常类

urllib.error.URLError

运行urllib.request出现异常时的异常处理类
reason属性，包含错误的一些信息

from urllib import request
from urllib import error

url = (r"http://zheshigeshenmewangzhan.org/")
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'Connection': 'keep-alive',
}
try:
    req = request.Request(url, headers=headers)
    request.urlopen(req)
except error.URLError as e:
    print(e.reason)

输出结果为 [Errno 8] nodename nor servname provided, or not known

urllib.error.HTTPError

URLError的子类，因此，如果URLError与HTTPError一同使用的时候，应该将HTTPError放在前
更有利于处理HTTP的错误
code 属性，HTTP状态码
reason 属性，错误信息
headers属性，HTTP响应头

from urllib import request
from urllib import error

if __name__ == "__main__":
    url = (r"http://www.sina.com/asd")
    try:
        response = request.urlopen(url)
    except error.HTTPError as e:
        print(e.reason)
        print(e.code)
        print(e.headers)

输出结果为：
Not Found
404
Server: nginx
Date: Wed, 08 Aug 2018 14:44:57 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Age: 0
Via: http/1.1 ctc.ningbo.ha2ts4.74 (ApacheTrafficServer/6.2.1 [cMsSf ]), http/1.1 ctc.xiamen.ha2ts4.34 (ApacheTrafficServer/6.2.1 [cMsSf ])
X-Via-Edge: 15337394975367cf664713cd64cde3e31ef1b
X-Cache: MISS.34
X-Via-CDN: f=edge,s=ctc.xiamen.ha2ts4.35.nb.sinaedge.com,c=113.100.246.124;f=Edge,s=ctc.xiamen.ha2ts4.34,c=222.76.214.35;f=edge,s=ctc.ningbo.ha2ts4.71.nb.sinaedge.com,c=222.76.214.34;f=Edge,s=ctc.ningbo.ha2ts4.74,c=115.238.190.71

urllib.parse解析URL为组件

主要包括两大功能：URL解析 与 URL引用

主要的方法

urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)

将URL解析为六个组件，返回一个6元组。对应于URL的一般结构：scheme://netloc/path;parameters?query#fragment
urlstring：url路径
scheme：方案说明符，http等
allow_fragments：如果参数为false，则无法识别fragment片段标识符，相反，它被解析为路径，参数或查询组件的一部分，返回的fragment在返回值中为空

>>> from urllib.parse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

urllib.parse.quote(string, safe=’/’, encoding=None, errors=None)

按照标准， URL 只允许一部分 ASCII 字符（数字字母和部分符号），其他的字符（如汉字）是不符合 URL 标准的，因此需要进行编码
quote 除了 -._/09AZaz ,都会进行编码
quote_plus(也是一个方法) 比 quote 『更进』一些，它还会编码 /
safe：安全不转换字符，默认是/
encoding：编码方式，默认utf-8
errors：默认是’strict’，意味着不支持的字符会出现 UnicodeEncodeError

from urllib.parse import urlparse,quote

if __name__ == "__main__":
    print(quote("www.baidu.com?a=*",safe="*"))

输出结果：
www.baidu.com%3Fa%3D*

urllib.parse.unquote(string, encoding=’utf-8’, errors=’replace’)

有编码自然要有解码
errors：默认是’replace’ ，意味着无效序列被占位符替换

from urllib.parse import unquote,quote

if __name__ == "__main__":
    print(quote("www.baidu.com?a=*"))
    print(unquote("www.baidu.com%3Fa%3D%2A"))

输出结果
www.baidu.com%3Fa%3D%2A
www.baidu.com?a=*

urllib.robotparser解析robots.txt文件

此模块提供单个类，RobotFileParser用于回答有关特定用户代理是否可以在发布该robots.txt文件的网站上获取URL的问题

class urllib.robotparser.RobotFileParser(url=”）

set_url（url ）设置引用robots.txt文件的URL
read（）读取robots.txtURL并将其提供给解析器
can_fetch（useragent，url ） useragent用户客户端，根据True或False来判断是否能以某个agent访问

from urllib import robotparser

rp = robotparser.RobotFileParser()
rp.set_url('https://www.douban.com/robots.txt')
rp.read()
url = 'https://www.douban.com'
user_agent = 'Wandoujia Spider'
can = rp.can_fetch(user_agent, url)
print(rp)
print(can)

输出结果：
User-agent: Wandoujia Spider
Disallow: /
False

Python urllib 模块基础使用

urllib ：URL处理模块

urllib.request 定义了有助于处理HTTP的函数与类

urllib.request主要方法

urllib.request.urlopen

urllib.request还提供了以下主要类

Request 该类是url请求的抽象类

urllib.error定义了urllib.request引发的异常类

包含的异常类

urllib.error.URLError

urllib.error.HTTPError

urllib.parse解析URL为组件

主要的方法

urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)

urllib.parse.quote(string, safe=’/’, encoding=None, errors=None)

urllib.parse.unquote(string, encoding=’utf-8’, errors=’replace’)

urllib.robotparser解析robots.txt文件

class urllib.robotparser.RobotFileParser(url=”）

猜你喜欢