版权声明:本文为博主原创文章,未经博主女朋友允许不得转载。 https://blog.csdn.net/qq_26442553/article/details/85223800
urlib.parse模块,主要是对url数据进行解析,分解,组合等操作。目前urllib.parse模块下主要有urllib.parse.urlpase,urllib.parse.urlunparse,urlliib.parse.urljoin和urlencode常用几个方法。
1.urlparse()的使用
urlparse模块主要是把url拆分为6部分,并返回元组。urlparse将url分为6个部分,返回一个包含6个字符串项目的元组:协议、位置、路径、参数、查询、片段。解析url的urlparpse函数使用,参数格式如下:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
1.1.urlparse()只有一个参数urlstring的使用
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)
'''结果如下:
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com',
path='/index.html', params='user', query='id=5', fragment='comment')
'''
如上代码输出结果所示:其中 scheme 是协议,netloc 是域名服务器 ,path 相对路径 ,params是参数,query是查询的条件。
1.2.urlparse(),scheme参数的使用,解析协议
from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
'''将url按照https的协议进行解析,输入的url没有带协议版本
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')'''
2.如果输入的url已经带协议版本了,按实际的协议解析,如下尽管指定https,实际按http解析
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
'''结果如下:
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
'''
1.3.urlparse的allow_fragments参数使用
#演示1:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)
'''结果如下
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')
'''
#演示2.
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)
'''结果如下:
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
'''
2.urlunparse是urlparse功能的相对作用
#1.对网页解析,使用urlparse
from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/s?wd=urlparse&rsv_spt=1&rsv_iqid=0x953bd4980021e01a&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&oq=urrparse&rsv_t=45167nYI8NDE6%2Bb1WvuUFOa44byBJFoinf0m87edhrxTkQZS9Miqh5laqUbkoGFI5ACl&inputT=3153&rsv_pq=8065196e001fc0c7&rsv_sug3=23&bs=urrparse')
print(type(result), result)
'''解析结果如下:
<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='wd=urlparse&rsv_spt=1&rsv_iqid=0x953bd4980021e01a&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&oq=urrparse&rsv_t=45167nYI8NDE6%2Bb1WvuUFOa44byBJFoinf0m87edhrxTkQZS9Miqh5laqUbkoGFI5ACl&inputT=3153&rsv_pq=8065196e001fc0c7&rsv_sug3=23&bs=urrparse', fragment='')
'''
#2.对上面解析的网页数据进行urlunparse操作
from urllib.parse import urlunparse
data = ['https', 'www.baidu.com', '/s', '', 'wd=urlparse&rsv_spt=1&rsv_iqid=0x953bd4980021e01a&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&oq=urrparse&rsv_t=45167nYI8NDE6%2Bb1WvuUFOa44byBJFoinf0m87edhrxTkQZS9Miqh5laqUbkoGFI5ACl&inputT=3153&rsv_pq=8065196e001fc0c7&rsv_sug3=23&bs=urrparse', '']
print(urlunparse(data))
'''urlunparse结果如下:
https://www.baidu.com/s?wd=urlparse&rsv_spt=1&rsv_iqid=0x953bd4980021e01a&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&oq=urrparse&rsv_t=45167nYI8NDE6%2Bb1WvuUFOa44byBJFoinf0m87edhrxTkQZS9Miqh5laqUbkoGFI5ACl&inputT=3153&rsv_pq=8065196e001fc0c7&rsv_sug3=23&bs=urrparse
'''
3.urljoin对多个url进行合并
合并的原则是以后面的url为准,如果后面的有则留下,如果没有则从前面的取值补充。
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
'''结果如下:
http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2
'''
4.urlencode把字典对象转换成get请求参数
from urllib.parse import urlencode
params = {
'name': 'germey',
'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
'''测试结果如下:
http://www.baidu.com?name=germey&age=22
'''