Using asynchronous thread pool reptile
Compare the time and asynchronous synchronization crawling crawling using the time.sleep () Analog delay request access.
1. synchronization request
from time import sleep import time from multiprocessing.dummy import Pool def request(url): print('正在请求:',url) sleep(2) print('下载成功:',url) start = time.time() urls = ['www.baidu.com','www.sogou.com','www.goubanjia.com'] for url in urls: request(url) print(time.time()-start)
2. Use the thread pool
from time import sleep import time from multiprocessing.dummy import Pool def request(url): print('正在请求:',url) sleep(2) print('下载成功:',url) start = time.time() urls = ['www.baidu.com','www.sogou.com','www.goubanjia.com'] pool = Pool(3) pool.map(request,urls) print(time.time()-start)
- Example: using a thread pool crawling video data in the video pear
- The first five videos in the social classification
from multiprocessing.dummy import Pool import requests from lxml import etree import re import random url = 'https://www.pearvideo.com/category_1' headers = { "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' } response = requests.get(url=url,headers=headers).text tree = etree.HTML(response) li_list = tree.xpath('//*[@id="listvideoListUl"]/li') video_list = [] for li in li_list: video = li.xpath('./div/a/@href')[0] detail_url = 'https://www.pearvideo.com/' + video detail_page_text = requests.get(url=detail_url,headers=headers).text ret = re.findall(',srcUrl="(.*?)",',detail_page_text,re.S)[0] print(ret) video_list.append(ret) defget_video_data (URL): return requests.get (URL = URL, headers = headers) .content DEF save_video (Data): name = STR (the random.randint (0,9999)) + ' .mp4 ' with Open (name, ' WB ' ) AS F: f.write (data) Print (name, " download success " ) # list binary data composed of four video all_video_data_list = pool.map (get_video_data, video_list) the pool = Pool (. 4 ) pool.map (save_video, all_video_data_list)
Coroutine basis
Asynchronous more efficient, and less off than the cost of overhead is small thread threads than processes, coroutines, so when reptiles, recommended the use of single-threaded asynchronous + coroutine way, can improve the lot of efficiency
- event_loop : event loop, the equivalent of an infinite loop, we can put some special function registers (placed) on the event loop, when certain conditions are met, the function will be executed cyclically. Procedure is performed in the set sequence from the beginning to the end, the number of runs is set exactly. When writing asynchronous program, which is bound to run time-consuming part of the process is relatively long, we need to let the current control program, let it run in the background, so that another part of the program up and running first. When the program is running behind the completion of the main program also needs to promptly notify the task has been completed can the next step, but the time required for this process is uncertain, the main program needs continuous monitoring state, upon receipt of a message task completion , we begin the next step. It is this continuous loop monitor.
- the coroutine : Chinese translation is called coroutines, refers to the process on behalf of the Association object type, we can register coroutine object to the event loop, it will be an event in the cycle is called Python often. We can use async keyword to define a method that will not be executed immediately when you call, but returns a coroutine object.
- Task : task, which is further encapsulated to coroutine object contains the status of each task.
- Future : on behalf of the future tasks to perform or not to perform, in fact, and the task is no essential difference.
In addition, we also need to know the async / the await keyword, which is just emerging from Python 3.6, especially for defining the coroutine. Wherein, async define a coroutine, await a method for blocking pending execution. Official Documents
To test the following two parts of the code, we can set up their own server to guarantee a blocking time, using simple structures flask
from flask import Flask import time app = Flask(__name__) @app.route('/test1') def index_bobo(): time.sleep(2) return 'Hello bobo' @app.route('/test2') def index_jay(): time.sleep(2) return 'Hello jay' @app.route('/test3') def index_tom(): time.sleep(2) return 'Hello tom' if __name__ == '__main__': app.run(threaded=True)
1. The use of substantially coroutine, asynico module
Import ASYNCIO the async DEF Request (URL): Print ( ' being requested: ' , URL) Print ( ' download success: ' , URL) C = Request ( ' www.baidu.com ' ) # instantiates a cyclic object event loop = asyncio.get_event_loop () # Create a task object, the object is to package the coroutine object # task loop.create_task = (C) # another form of the method the object instance tasks task = asyncio.ensure_future (C) Print ( Task) # the coroutine object is registered to the event loop object, and we need to start the event loop objects loop.run_until_complete (task) print(task)
2. The task object bound to the callback function
Import ASYNCIO the async DEF Request (url): Print ( ' is being requested: ' , url) Print ( ' download success: ' , url) return url # callback function must have a parameter: Task # task.result (): Object Task special coroutine inside the package corresponding to the object function return value DEF callbak (Task): Print ( ' ! the this the callback iS ' ) Print (task.result ()) C = Request ( ' www.baidu.com ' ) # give The task object to bind a callback task = asyncio.ensure_future(c) task.add_done_callback(callbak) loop = asyncio.get_event_loop() loop.run_until_complete(task)
3. Multi-task asynchronous coroutine
Note here that, when time.sleep () simulation obstruction, will complain, because asynchronous Association process, can not appear relevant code does not support asynchronous, and before the function coroutines to add keywords await, otherwise it will error
from Time Import SLEEP Import ASYNCIO Import Time URLs = [ ' www.baidu.com ' , ' www.sogou.com ' , ' www.goubanjia.com ' ] Start = the time.time () the async DEF Request (URL): Print ( ' is being requested: ' , url) # multitasking asynchronous coroutine implementation, can not appear relevant code does not support asynchronous. # SLEEP (2) the await asyncio.sleep (2 ) Print ( ' download success: ', URL) Loop = asyncio.get_event_loop () # task list: placing a plurality of task objects Tasks = [] for URL in URLs: C = Request (URL) Task = asyncio.ensure_future (C) tasks.append (Task) Loop. run_until_complete (asyncio.wait (Tasks)) Print (time.time () - Start)
4. Induction coroutine multitasking application among reptiles
When we use a simple application requests to the module reptile program, we found no effect of increasing, because the requests module does not support asynchronous operation , so we use aiohttp request
So you should use aiohttp request access
import requests import asyncio import time import aiohttp #单线程+多任务异步协程 urls = [ 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/tom' ] #代理操作: #async with await s.get(url,proxy="http://ip:port") as response: async def get_pageText(url): async with aiohttp.ClientSession() as s: async with await s.get(url) as response: page_text =response.text the await () # responds by means of a data parsing operation callback return page_text # package callback function for data analysis DEF the parse (Task): # 1. fetch response data page_text = task.result () Print (page_text + ' , the upcoming data analysis !!! ' ) # parsing write operation at that location Start = time.time () tasks = [] for url in urls: c = get_pageText (url) task = asyncio.ensure_future (c) # to the task Object binding callback function for data analysis task.add_done_callback(parse) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) print(time.time()-start)