urllib之爬虫

版权声明:https://blog.csdn.net/thfyshz版权所有 https://blog.csdn.net/thfyshz/article/details/83445270

1.初识urllib

urllib库包含以下模块:

  • urllib.request——打开和读取 URLs
  • urllib.error——urllib.request异常处理
  • urllib.parse——解码URLs
  • urllib.robotparser——解码robots.txt

2.urllib爬虫

2.1 简单的get方法

简单粗暴,容易被封:

from urllib import request,parse

#直接爬取
url = "http://httpbin.org/"
string = request.urlopen(url).read().decode('utf8')
print(string)

加上headers,伪装爬虫:

url = "http://httpbin.org/"
headers = {
    'Host': 'httpbin.org',
    'Referer': 'http://httpbin.org/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
req = request.Request(url, headers=headers, method='GET')
string = request.urlopen(req).read().decode('utf8')
print(string)

以上两种都会返回网页源代码。

2.2高级一点的post

带上data,发送数据:

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'Host': 'httpbin.org',
    'Referer': 'http://httpbin.org/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
dict = {
    'name': 'abc',
    'password': '123'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
string = request.urlopen(req).read().decode('utf8')
print(string)

返回一个json文件:

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "abc",
    "password": "123"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "21",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "Referer": "http://httpbin.org/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
  },
  "json": null,
  "origin": "202.119.46.99",
  "url": "http://httpbin.org/post"
}

如果使用json可以读取其中内容:

import json
j = json.loads(string)
print(j['form']['name'])
#输出:
abc

2.3使用cookie

第一步:获取cookie

import http.cookiejar
from urllib import request
cookie = http.cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

得到返回的cookie:

BAIDUID=112C1EAFD************1B0E3DC9:FG=1
BIDUPSID=112C************F31C931B0E3DC9
H_PS_PSSID=
PSTM=15*****188
delPer=0
BDSVRTM=0
BD_HOME=0

第二步:保存cookie到本地
(1)MozillaCookieJar方式

import http.cookiejar
from urllib import request
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

(2)LWPCookieJar方式

import http.cookiejar
from urllib import request
filename = 'cookie1.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

两者得到的cookie文件有差异

第三步:使用已保存的cookie
保存的哪种类型cookie,就用哪种类型再打开读取:

import http.cookiejar
from urllib import request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie1.txt', ignore_discard=True, ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

将会返回网页源代码

猜你喜欢

转载自blog.csdn.net/thfyshz/article/details/83445270
今日推荐