Urllib是python内置的HTTP请求库
包括以下模块
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块
- request
urllib.requeset.urlopen(url,data,timeout)
request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
添加头部参数
dict={“name”:”hello”}
data = bytes(urllib.parse.urlencode(dict), encoding='utf8')
response = urllib.request.urlopen('http://baidu.com', data=data)
print(response.read()
data参数的时候就是以post请求方式请求,如果没有data参数就是get请求方式
header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' } url='http://www.baidu.com' req = urllib.request.Request(url=url,headers=header) req.add_header("key","value") response=urllib.request.urlopen(req) data=response.read().decode('utf-8') print(data) f=open('baidu.txt','a+',encoding='utf-8') f.write(data) f.close;
使用代理:
使用同一个IP去爬取同一个网站上的网页,久了之后会被该网站服务器屏蔽
url='http://www.baidu.com' proxy_addr='12.17.171.129:8080' proxy=urllib.request.ProxyHandler({'http':proxy_addr}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data=urllib.request.urlopen(url).read().decode('utf8') print(data)