python urllib 模块

Urllib是python内置的HTTP请求库

包括以下模块

urllib.request 请求模块

urllib.error 异常处理模块

urllib.parse url解析模块

urllib.robotparser robots.txt解析模块

  1. request

urllib.requeset.urlopen(url,data,timeout)

request = urllib.request.Request('https://python.org')

response = urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

   添加头部参数

dict={“name”:”hello”}

data = bytes(urllib.parse.urlencode(dict), encoding='utf8')

response = urllib.request.urlopen('http://baidu.com', data=data)

print(response.read()

data参数的时候就是以post请求方式请求,如果没有data参数就是get请求方式

header = {

   'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'

}

url='http://www.baidu.com'

req = urllib.request.Request(url=url,headers=header)

req.add_header("key","value")

response=urllib.request.urlopen(req)

data=response.read().decode('utf-8')

print(data)

f=open('baidu.txt','a+',encoding='utf-8')

f.write(data)

f.close;

使用代理:

使用同一个IP去爬取同一个网站上的网页,久了之后会被该网站服务器屏蔽

url='http://www.baidu.com'

proxy_addr='12.17.171.129:8080'

proxy=urllib.request.ProxyHandler({'http':proxy_addr})

opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)

urllib.request.install_opener(opener)

data=urllib.request.urlopen(url).read().decode('utf8')

print(data)

猜你喜欢

转载自blog.csdn.net/henku449141932/article/details/81188577