协程:实现并发请求

关于Python协程的讨论,一般出现最多的几个关键字就是:

  • 阻塞
  • 非阻塞
  • 同步
  • 异步
  • 并发
  • 并行
  • 协程
  • asyncio
  • aiohttp

概念知识的话,感觉以下两篇博文都讲得不错,这里就不转了,直接贴地址:
http://python.jobbole.com/87310/
http://python.jobbole.com/88291/
https://aiohttp.readthedocs.io/en/stable/client_quickstart.html #这个是aiohttp的文档

我在下面的内容就当作一些练习题好了。

  • 定义一个协程函数,请求一次腾讯网
import asyncio
import aiohttp


async def get_page(url):
	async with aiohttp.ClientSession() as session:
		async with session.get(url) as resp:
			page = await resp.text(encoding='GB18030')
			print(page)


url = 'https://www.qq.com'
loop = asyncio.get_event_loop()
loop.run_until_complete(get_page(url))
  • 给协程添加一个回调函数,抓取腾讯网的标题
import asyncio
import aiohttp
import re


async def get_page(url):
	async with aiohttp.ClientSession() as session:
		async with session.get(url) as resp:
			page = await resp.text(encoding='GB18030')
			return page


def callback(future):
	pattern = '<title>(.*?)</title>'
	item = re.findall(pattern, future.result())
	print(item)


url = 'https://www.qq.com'
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(get_page(url))
task.add_done_callback(callback)
loop.run_until_complete(task)
  • 并发请求100次腾讯网,平均耗时约为:2.349秒
import asyncio
import aiohttp
import re
import time


async def get_page(url):
	async with aiohttp.ClientSession() as session:
		async with session.get(url) as resp:
			page = await resp.text(encoding='GB18030')
			return page


def callback(future):
	pattern = '<title>(.*?)</title>'
	item = re.findall(pattern, future.result())
	print(item)


url 	=	'https://www.qq.com'
loop 	=	asyncio.get_event_loop()
tasks 	= 	[asyncio.ensure_future(get_page(url)) for _ in range(100)]
for task in tasks:
	task.add_done_callback(callback)
start	= 	time.time()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)

在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述

  • 上面是利用异步的方式,下面用同步的方式请求100次腾讯网,平均耗时为:12.222秒
import requests
import re
import time

def get_page(url):
	resp = requests.get(url)
	return resp.text

def parse(page):
	pattern = '<title>(.*?)</title>'
	item = re.findall(pattern, page)
	print(item)


url = 'https://www.qq.com'
start = time.time()
for _ in range(100):
	page = get_page(url)
	parse(page)
print(time.time() - start)

在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述

  • 接着尝试开100条线程,得出平均耗时为:1.976秒
import requests
import re
import threading
import time


def get_page(url):
	resp = requests.get(url)
	return resp.text

def parse(page):
	pattern = '<title>(.*?)</title>'
	item = re.findall(pattern, page)
	print(item)

def main(url):
	page = get_page(url)
	parse(page)


url	= 'https://www.qq.com'
ts = [threading.Thread(target=main, args=(url,)) for _ in range(100)]
start = time.time()
for t in ts:
	t.start()
for t in ts:
	t.join()
print(time.time() - start)

在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述

到目前来看,同步多线程的请求方式与异步协程的请求方式,效率是差不多的。
下面再请求另一个网站:豆瓣电影信息,这个网站比起腾讯网首页的响应速度要慢得多。而且我会抓取更多的信息,代码根据我之前的一篇博客来作出一些修改 https://blog.csdn.net/eighttoes/article/details/86159500

  • 首先,异步协程的方式请求30次(豆瓣网会封禁爬虫,暂时使用30次),平均耗时为:1.199秒
import asyncio
import aiohttp
import uagent #https://blog.csdn.net/eighttoes/article/details/82996377
import re
import redis
import time

async def get_page(url):
	headers = {'User-Agent': uagent.get_ua()}
	async with aiohttp.ClientSession() as session:
		async with session.get(url, headers=headers) as resp:
			page = await resp.text()
			return page

#一个回调函数,作用是爬取一些字段用于显示,不用过多关注
def callback(future):
	page = future.result()
	pattern1 = '<span property="v:itemreviewed">(.*?)</span>'
	item1 = re.findall(pattern1, page)
	pattern2 = 'rel="v:directedBy">(.*?)</a>'
	item2 = re.findall(pattern2, page)
	pattern3 = '<span property="v:initialReleaseDate" content="(.*?)">'
	item3 = re.findall(pattern3, page)
	pattern4 = '<span class="pl">制片国家/地区:</span>(.*?)<br/>'
	item4 = re.findall(pattern4, page)
	pattern5 = '<span class="pl">片长:</span> <span property="v:runtime" content=".*">(.*?)</span><br/>'
	item5 = re.findall(pattern5, page)
	pattern6 = '<strong class="ll rating_num" property="v:average">(.*?)</strong>'
	item6 = re.findall(pattern6, page)
	pattern7 = '<span property="v:genre">(.*?)</span>'
	item7 = re.findall(pattern7, page)
	pattern8 = '<a href="collections" class="rating_people"><span property="v:votes">(.*?)</span>人评价</a>'
	item8 = re.findall(pattern8, page)
	print((item1, item2, item3, item4, item5, item6, item7, item8))


#我在redis中保存了一些url,现在从其中取出30个url,准备用于发起请求
r = redis.StrictRedis(host='localhost', port=6379, db=0)
urls = [r.rpop('movie_lists').decode() for _ in range(30)]
	
tasks = [asyncio.ensure_future(get_page(url)) for url in urls]
for task in tasks:
	task.add_done_callback(callback)
loop = asyncio.get_event_loop()
start = time.time()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)	

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • 接着用同步多线程的方式,请求30次,平均耗时为:1.126秒
import requests
import uagent #https://blog.csdn.net/eighttoes/article/details/82996377
import re
import redis
import threading
import time

def get_page(url):
	headers = {'User-Agent': uagent.get_ua()}
	resp = requests.get(url, headers=headers)
	page = resp.text
	return page

#一个解析字段的函数,作用是爬取一些字段用于显示,不用过多关注
def parse(page):
	pattern1 = '<span property="v:itemreviewed">(.*?)</span>'
	item1 = re.findall(pattern1, page)
	pattern2 = 'rel="v:directedBy">(.*?)</a>'
	item2 = re.findall(pattern2, page)
	pattern3 = '<span property="v:initialReleaseDate" content="(.*?)">'
	item3 = re.findall(pattern3, page)
	pattern4 = '<span class="pl">制片国家/地区:</span>(.*?)<br/>'
	item4 = re.findall(pattern4, page)
	pattern5 = '<span class="pl">片长:</span> <span property="v:runtime" content=".*">(.*?)</span><br/>'
	item5 = re.findall(pattern5, page)
	pattern6 = '<strong class="ll rating_num" property="v:average">(.*?)</strong>'
	item6 = re.findall(pattern6, page)
	pattern7 = '<span property="v:genre">(.*?)</span>'
	item7 = re.findall(pattern7, page)
	pattern8 = '<a href="collections" class="rating_people"><span property="v:votes">(.*?)</span>人评价</a>'
	item8 = re.findall(pattern8, page)
	print((item1, item2, item3, item4, item5, item6, item7, item8))

def main(url):
	page = get_page(url)
	parse(page)

#我在redis中保存了一些url,现在从其中取出30个url,准备用于发起请求
r = redis.StrictRedis(host='localhost', port=6379, db=0)
urls = [r.rpop('movie_lists').decode() for _ in range(30)]

ts = [threading.Thread(target=main, args=(url,)) for url in urls]
start = time.time()
for t in ts:
	t.start()
for t in ts:
	t.join()
print(time.time() - start)
	

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
同样的,同步多线程的请求方式与异步协程的请求方式,效率还是相差无几。

  • 现在回到请求腾讯网的例子,这次我把异步协程的并发量设置为500,平均耗时为:13.5918秒
import asyncio
import aiohttp
import re
import time


async def get_page(url):
	async with aiohttp.ClientSession() as session:
		async with session.get(url) as resp:
			page = await resp.text(encoding='GB18030')
			return page


def callback(future):
	pattern = '<title>(.*?)</title>'
	item = re.findall(pattern, future.result())
	print(item)


url 	=	'https://www.qq.com'
loop 	=	asyncio.get_event_loop()
tasks 	= 	[asyncio.ensure_future(get_page(url)) for _ in range(500)]
for task in tasks:
	task.add_done_callback(callback)
start	= 	time.time()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)

在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述
*然后是同步多线程,开500条线程,平均耗时为:10.98秒

import requests
import re
import threading
import time


def get_page(url):
	resp = requests.get(url)
	return resp.text

def parse(page):
	pattern = '<title>(.*?)</title>'
	item = re.findall(pattern, page)
	print(item)

def main(url):
	page = get_page(url)
	parse(page)


url	= 'https://www.qq.com'
ts = [threading.Thread(target=main, args=(url,)) for _ in range(500)]
start = time.time()
for t in ts:
	t.start()
for t in ts:
	t.join()
print(time.time() - start)

在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述
看来同步多线程和异步协程的效率是差不多的。后来我再把协程的并发量增加到1000,程序运行不了了,出现了ValueError: too many file descriptors in select(),不过多线程开到1000条还能运行。

扫描二维码关注公众号,回复: 5302329 查看本文章

既然效率差不多,那么使用哪种编程方式就是看个人喜好了。
这里讨论的效率问题是基于网络IO的情景,其他情景我不知道。

猜你喜欢

转载自blog.csdn.net/eighttoes/article/details/86582479
今日推荐