Python爬虫之网络请求

urllib库中常用函数的使用

安装
urllib库是安装python时自带的一个库，不需要再另外安装；
使用

#!/usr/bin/python3
# -*- coding:utf-8 -*-
# @Time    : 2018-11-10 21:25
# @Author  : Manu
# @Site    : 
# @File    : urllib_lib.py
# @Software: PyCharm

from urllib import request
from urllib import parse

# ulropen
# 打开网页请求
response = request.urlopen('http://www.baidu.com')
text = response.read()
print(text)

# urlretrieve
# 将网页文件保存到本地
csdnhtml = request.urlretrieve('https://blog.csdn.net/github_39655029', 'csdn.html')
print(csdnhtml)

# urlencode
# 想浏览器发送请求时，将字典数据转换为URL编码的数据
data = {'team':'Spurs', 'Coach':'波波', 'age':'69'}
qs = parse.urlencode(data)
print(qs)

# parse_qs
# 将编码后的url进行解析
qs = 'age=69&team=Spurs&Coach=%E6%B3%A2%E6%B3%A2'
print(parse.parse_qs(qs))

# urlparse&urlsplit
# 将url中的各个部分进行分割，两者不同在于urlparse多了一个params属性
url = 'https://blog.csdn.net/github_39655029'
url_list = parse.urlsplit(url)
url_list = parse.urlparse(url)
print(url_list)

# request.Request类
# 增加一些请求头
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
req = request.Request(url, headers=headers)
res = request.urlopen(req)
print(res.read().decode('utf-8'))

# ProxyHandler处理器
# 设置代理服务器，防止网站检测时被检测出来
# 常用代理
# 西刺免费代理IP：http://www.xicidaili.com
# 快代理：http://www.kuaidaili.com
# 代理云：http://www.dailiyun.com
# 原理：在请求目的网站前，先请求代理服务器，然后经服务器请求目的网站，代理服务器获取目的网站数据后，在再转发给我们
handler = request.ProxyHandler({'https':'223.145.212.16:8118'})
opener = request.build_opener(handler)
req = request.Request('http://httpbin.org/ip')
resq = opener.open(req)
print(resq.read())

Cookie

格式：Set-Cookie：NAME=VALUE：Expires/Max=DATE:Path=PATH:Domain=DOMAIN_NAME:SECURE
- NAME：cookie的名字；
- VALUE：cookie的值；
- Expires：cookie的过期时间；
- Path：cookie作用的路径；
- Domain：cookie作用的域名；
- SECURE：是否作用于https协议；
使用

from urllib import request
from http.cookiejar import MozillaCookieJar
# cookie保存与加载
cookiejar = MozillaCookieJar('cookie.txt')
# 加载
cookiejar.load(ignore_discard=True)
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)

# 保存
resp = opener.open('https://www.baidu.com')
cookiejar.save(ignore_discard=True)
for cookie in cookiejar:
    print(cookie)

requests库的基本使用

安装
在控制命令台使用pip命令安装即可，使用命令如下；

pip install requests

使用

import requests
# 发送get请求
kw = {'kw':'村雨'}
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
response = requests.get('http://www.baidu.com/s', params=kw, headers=headers, verify = False)
# 直接从网络抓取的内容，未经编码，是bytes类型
print(response.content.decode('utf-8'))
# 将response.content进行解码的字符串，解码需要指定编码方式
print(response.text)
# 返回状态码
print(response.status_code)
# 查看相应头部字符编码
print(response.encoding)
# 查看完整url地址
print(response.url)

with open('cunyu.html', 'w', encoding='utf-8 ') as cy:
    cy.write(response.content.decode('utf-8'))
# 查看cookies
print(response.cookies)

# 代理使用
# proxy = {'http':'218.25.131.121:33859'}
# res = requests.get('http://www.baidu.com', headers=headers, proxies=proxy)
# print(res.text)

# 多次请求中使用cookies，使用Session
session = requests.Session()
session.post('http://www.baidu.com', headers=headers, data=kw)
res = session.get('http://www.baidu.com')
print('demo' + res.text)

总结

本篇介绍了爬虫中有关网络请求的相关知识，通过阅读，你将了解到urllib和requests库的相关使用方法，并对Cookies有进一步的了解，如果你有更好的想法和建议，欢迎留言交流。

Python爬虫之网络请求

urllib库中常用函数的使用

Cookie

requests库的基本使用

总结

猜你喜欢