爬虫简介
概述
近年来,随着网络应用的逐渐扩展和深入,如何高效的获取网上数据成为了无数公司和个人的追求,在大数据时代,谁掌握了更多的数据,谁就可以获得更高的利益,而网络爬虫是其中最为常用的一种从网上爬取数据的手段。
网络爬虫,即Web Spider,是一个很形象的名字。如果把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。
爬虫的价值
互联网中最有价值的便是数据,比如天猫商城的商品信息,链家网的租房信息,雪球网的证券投资信息等等,这些数据都代表了各个行业的真金白银,可以说,谁掌握了行业内的第一手数据,谁就成了整个行业的主宰,如果把整个互联网的数据比喻为一座宝藏,那我们的爬虫课程就是来教大家如何来高效地挖掘这些宝藏,掌握了爬虫技能, 你就成了所有互联网信息公司幕后的老板,换言之,它们都在免费为你提供有价值的数据。
爬虫的基本流程
预备知识
requests模块
基本语法
requests模块支持的请求:
1
2
3
4
5
6
7
|
import
requests
requests.get(
"http://httpbin.org/get"
)
requests.post(
"http://httpbin.org/post"
)
requests.put(
"http://httpbin.org/put"
)
requests.delete(
"http://httpbin.org/delete"
)
requests.head(
"http://httpbin.org/get"
)
requests.options(
"http://httpbin.org/get"
)
|
get请求
1 基本请求
1
2
3
4
5
|
import
requests
response
=
requests.get(
'https://www.jd.com/'
,)
with
open
(
"jd.html"
,
"wb"
) as f:
f.write(response.content)
|
2 含参数请求
1
2
3
|
import
requests
response
=
requests.get(
'https://s.taobao.com/search?q=手机'
)
response
=
requests.get(
'https://s.taobao.com/search'
,params
=
{
"q"
:
"美女"
})
|
3 含请求头请求
1
2
3
4
5
6
|
import
requests
response
=
requests.get(
'https://dig.chouti.com/'
,
headers
=
{
'User-Agent'
:
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'
,
}
)
|
4 含cookies请求
1
2
3
4
5
6
7
8
|
import
uuid
import
requests
url
=
'http://httpbin.org/cookies'
cookies
=
dict
(sbid
=
str
(uuid.uuid4()))
res
=
requests.get(url, cookies
=
cookies)
print
(res.json())
|
5 request.session()
1
2
3
4
5
6
7
8
9
|
import
requests
# res=requests.get("https://www.zhihu.com/explore")
# print(res.cookies.get_dict())
session
=
requests.session()
res1
=
session.get(
"https://www.zhihu.com/explore"
)
print
(session.cookies.get_dict())
res2
=
session.get(
"https://www.zhihu.com/question/30565354/answer/463324517"
,cookies
=
{
"abs"
:
"123"
}
|
post请求
1 data参数
requests.post()用法与requests.get()完全一致,特殊的是requests.post()多了一个data参数,用来存放请求体数据
1
|
response
=
requests.post(
"http://httpbin.org/post"
,params
=
{
"a"
:
"10"
}, data
=
{
"name"
:
"yuan"
})
|
2 发送json数据
1
2
3
4
5
6
|
import
requests<br>
res1
=
requests.post(url
=
'http://httpbin.org/post'
, data
=
{
'name'
:
'yuan'
})
#没有指定请求头,#默认的请求头:application/x-www-form-urlencoed
print
(res1.json())
res2
=
requests.post(url
=
'http://httpbin.org/post'
,json
=
{
'age'
:
"22"
,})
#默认的请求头:application/json)
print
(res2.json())
|
response对象
(1) 常见属性
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import
requests
respone
=
requests.get(
'https://sh.lianjia.com/ershoufang/'
)
# respone属性
print
(respone.text)
print
(respone.content)
print
(respone.status_code)
print
(respone.headers)
print
(respone.cookies)
print
(respone.cookies.get_dict())
print
(respone.cookies.items())
print
(respone.url)
print
(respone.history)
print
(respone.encoding)
|
(2) 编码问题
1
2
3
4
5
|
import
requests
response
=
requests.get(
'http://www.autohome.com/news'
)
#response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的,而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码
with
open
(
"res.html"
,
"w"
) as f:
f.write(response.text)
|
(3) 下载二进制文件(图片,视频,音频)
1
2
3
4
5
6
|
import
requests
response
=
requests.get(
'http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg'
)
with
open
(
"res.png"
,
"wb"
) as f:
# f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的
for
line
in
response.iter_content():
f.write(line)
|
(4) 解析json数据
1
2
3
4
5
6
7
|
import
requests
import
json
response
=
requests.get(
'http://httpbin.org/get'
)
res1
=
json.loads(response.text)
#太麻烦
res2
=
response.json()
#直接获取json数据
print
(res1
=
=
res2)
|
(5) Redirection and History
默认情况下,除了 HEAD, Requests 会自动处理所有重定向。可以使用响应对象的 history
方法来追踪重定向。Response.history
是一个 Response
对象的列表,为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。
1
2
3
4
5
6
7
|
>>> r
=
requests.get(
'http://github.com'
)
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[<Response [
301
]>]
|
另外,还可以通过 allow_redirects
参数禁用重定向处理:
1
2
3
4
5
|
>>> r
=
requests.get(
'http://github.com'
, allow_redirects
=
False
)
>>> r.status_code
301
>>> r.history
[]
|
应用案例
1、模拟GitHub登录,获取登录信息
import requests import re #请求1: r1=requests.get('https://github.com/login') r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权) authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN print("authenticity_token",authenticity_token) #第二次请求:带着初始cookie和TOKEN发送POST请求给登录页面,带上账号密码 data={ 'commit':'Sign in', 'utf8':'✓', 'authenticity_token':authenticity_token, 'login':'[email protected]', 'password':'yuanchenqi0316' } #请求2: r2=requests.post('https://github.com/session', data=data, cookies=r1_cookie, # allow_redirects=False ) print(r2.status_code) #200 print(r2.url) #看到的是跳转后的页面:https://github.com/ print(r2.history) #看到的是跳转前的response:[<Response [302]>] print(r2.history[0].text) #看到的是跳转前的response.text with open("result.html","wb") as f: f.write(r2.content)
2、爬取豆瓣电影信息
import requests import re import json import time from concurrent.futures import ThreadPoolExecutor pool=ThreadPoolExecutor(50) def getPage(url): response=requests.get(url) return response.text def parsePage(res): com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>' '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S) iter_result=com.finditer(res) return iter_result def gen_movie_info(iter_result): for i in iter_result: yield { "id":i.group("id"), "title":i.group("title"), "rating_num":i.group("rating_num"), "comment_num":i.group("comment_num"), } def stored(gen): with open("move_info.txt","a",encoding="utf8") as f: for line in gen: data=json.dumps(line,ensure_ascii=False) f.write(data+"\n") def spider_movie_info(url): res=getPage(url) iter_result=parsePage(res) gen=gen_movie_info(iter_result) stored(gen) def main(num): url='https://movie.douban.com/top250?start=%s&filter='%num pool.submit(spider_movie_info,url) #spider_movie_info(url) if __name__ == '__main__': before=time.time() count=0 for i in range(10): main(count) count+=25 after=time.time() print("总共耗费时间:",after-before)
概述
近年来,随着网络应用的逐渐扩展和深入,如何高效的获取网上数据成为了无数公司和个人的追求,在大数据时代,谁掌握了更多的数据,谁就可以获得更高的利益,而网络爬虫是其中最为常用的一种从网上爬取数据的手段。
网络爬虫,即Web Spider,是一个很形象的名字。如果把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。
爬虫的价值
互联网中最有价值的便是数据,比如天猫商城的商品信息,链家网的租房信息,雪球网的证券投资信息等等,这些数据都代表了各个行业的真金白银,可以说,谁掌握了行业内的第一手数据,谁就成了整个行业的主宰,如果把整个互联网的数据比喻为一座宝藏,那我们的爬虫课程就是来教大家如何来高效地挖掘这些宝藏,掌握了爬虫技能, 你就成了所有互联网信息公司幕后的老板,换言之,它们都在免费为你提供有价值的数据。
爬虫的基本流程
预备知识
requests模块
基本语法
requests模块支持的请求:
1
2
3
4
5
6
7
|
import
requests
requests.get(
"http://httpbin.org/get"
)
requests.post(
"http://httpbin.org/post"
)
requests.put(
"http://httpbin.org/put"
)
requests.delete(
"http://httpbin.org/delete"
)
requests.head(
"http://httpbin.org/get"
)
requests.options(
"http://httpbin.org/get"
)
|
get请求
1 基本请求
1
2
3
4
5
|
import
requests
response
=
requests.get(
'https://www.jd.com/'
,)
with
open
(
"jd.html"
,
"wb"
) as f:
f.write(response.content)
|
2 含参数请求
1
2
3
|
import
requests
response
=
requests.get(
'https://s.taobao.com/search?q=手机'
)
response
=
requests.get(
'https://s.taobao.com/search'
,params
=
{
"q"
:
"美女"
})
|
3 含请求头请求
1
2
3
4
5
6
|
import
requests
response
=
requests.get(
'https://dig.chouti.com/'
,
headers
=
{
'User-Agent'
:
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'
,
}
)
|
4 含cookies请求
1
2
3
4
5
6
7
8
|
import
uuid
import
requests
url
=
'http://httpbin.org/cookies'
cookies
=
dict
(sbid
=
str
(uuid.uuid4()))
res
=
requests.get(url, cookies
=
cookies)
print
(res.json())
|
5 request.session()
1
2
3
4
5
6
7
8
9
|
import
requests
# res=requests.get("https://www.zhihu.com/explore")
# print(res.cookies.get_dict())
session
=
requests.session()
res1
=
session.get(
"https://www.zhihu.com/explore"
)
print
(session.cookies.get_dict())
res2
=
session.get(
"https://www.zhihu.com/question/30565354/answer/463324517"
,cookies
=
{
"abs"
:
"123"
}
|
post请求
1 data参数
requests.post()用法与requests.get()完全一致,特殊的是requests.post()多了一个data参数,用来存放请求体数据
1
|
response
=
requests.post(
"http://httpbin.org/post"
,params
=
{
"a"
:
"10"
}, data
=
{
"name"
:
"yuan"
})
|
2 发送json数据
1
2
3
4
5
6
|
import
requests<br>
res1
=
requests.post(url
=
'http://httpbin.org/post'
, data
=
{
'name'
:
'yuan'
})
#没有指定请求头,#默认的请求头:application/x-www-form-urlencoed
print
(res1.json())
res2
=
requests.post(url
=
'http://httpbin.org/post'
,json
=
{
'age'
:
"22"
,})
#默认的请求头:application/json)
print
(res2.json())
|
response对象
(1) 常见属性
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import
requests
respone
=
requests.get(
'https://sh.lianjia.com/ershoufang/'
)
# respone属性
print
(respone.text)
print
(respone.content)
print
(respone.status_code)
print
(respone.headers)
print
(respone.cookies)
print
(respone.cookies.get_dict())
print
(respone.cookies.items())
print
(respone.url)
print
(respone.history)
print
(respone.encoding)
|
(2) 编码问题
1
2
3
4
5
|
import
requests
response
=
requests.get(
'http://www.autohome.com/news'
)
#response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的,而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码
with
open
(
"res.html"
,
"w"
) as f:
f.write(response.text)
|
(3) 下载二进制文件(图片,视频,音频)
1
2
3
4
5
6
|
import
requests
response
=
requests.get(
'http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg'
)
with
open
(
"res.png"
,
"wb"
) as f:
# f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的
for
line
in
response.iter_content():
f.write(line)
|
(4) 解析json数据
1
2
3
4
5
6
7
|
import
requests
import
json
response
=
requests.get(
'http://httpbin.org/get'
)
res1
=
json.loads(response.text)
#太麻烦
res2
=
response.json()
#直接获取json数据
print
(res1
=
=
res2)
|
(5) Redirection and History
默认情况下,除了 HEAD, Requests 会自动处理所有重定向。可以使用响应对象的 history
方法来追踪重定向。Response.history
是一个 Response
对象的列表,为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。
1
2
3
4
5
6
7
|
>>> r
=
requests.get(
'http://github.com'
)
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[<Response [
301
]>]
|
另外,还可以通过 allow_redirects
参数禁用重定向处理:
1
2
3
4
5
|
>>> r
=
requests.get(
'http://github.com'
, allow_redirects
=
False
)
>>> r.status_code
301
>>> r.history
[]
|
应用案例
1、模拟GitHub登录,获取登录信息
import requests import re #请求1: r1=requests.get('https://github.com/login') r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权) authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN print("authenticity_token",authenticity_token) #第二次请求:带着初始cookie和TOKEN发送POST请求给登录页面,带上账号密码 data={ 'commit':'Sign in', 'utf8':'✓', 'authenticity_token':authenticity_token, 'login':'[email protected]', 'password':'yuanchenqi0316' } #请求2: r2=requests.post('https://github.com/session', data=data, cookies=r1_cookie, # allow_redirects=False ) print(r2.status_code) #200 print(r2.url) #看到的是跳转后的页面:https://github.com/ print(r2.history) #看到的是跳转前的response:[<Response [302]>] print(r2.history[0].text) #看到的是跳转前的response.text with open("result.html","wb") as f: f.write(r2.content)
2、爬取豆瓣电影信息
import requests import re import json import time from concurrent.futures import ThreadPoolExecutor pool=ThreadPoolExecutor(50) def getPage(url): response=requests.get(url) return response.text def parsePage(res): com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>' '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S) iter_result=com.finditer(res) return iter_result def gen_movie_info(iter_result): for i in iter_result: yield { "id":i.group("id"), "title":i.group("title"), "rating_num":i.group("rating_num"), "comment_num":i.group("comment_num"), } def stored(gen): with open("move_info.txt","a",encoding="utf8") as f: for line in gen: data=json.dumps(line,ensure_ascii=False) f.write(data+"\n") def spider_movie_info(url): res=getPage(url) iter_result=parsePage(res) gen=gen_movie_info(iter_result) stored(gen) def main(num): url='https://movie.douban.com/top250?start=%s&filter='%num pool.submit(spider_movie_info,url) #spider_movie_info(url) if __name__ == '__main__': before=time.time() count=0 for i in range(10): main(count) count+=25 after=time.time() print("总共耗费时间:",after-before)