Efficient solutions reptiles - + single-threaded asynchronous coroutine

Using asynchronous thread pool reptile

  Compare the time and asynchronous synchronization crawling crawling using the time.sleep () Analog delay request access.

1. synchronization request

from time import sleep
import time
from multiprocessing.dummy import Pool
def request(url):
    print('正在请求:',url)
    sleep(2)
    print('下载成功:',url)
start = time.time()
urls = ['www.baidu.com','www.sogou.com','www.goubanjia.com']

for url in urls:
    request(url)
print(time.time()-start)

2. Use the thread pool

from time import sleep
import time
from multiprocessing.dummy import Pool
def request(url):
    print('正在请求:',url)
    sleep(2)
    print('下载成功:',url)
start = time.time()
urls = ['www.baidu.com','www.sogou.com','www.goubanjia.com']

pool = Pool(3)
pool.map(request,urls)

print(time.time()-start)

 

 

 

 

 

  • Example: using a thread pool crawling video data in the video pear
    • The first five videos in the social classification
from multiprocessing.dummy import Pool
import requests
from lxml import etree
import re
import random

url = 'https://www.pearvideo.com/category_1'

headers = {
    "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}

response = requests.get(url=url,headers=headers).text
tree = etree.HTML(response)
li_list = tree.xpath('//*[@id="listvideoListUl"]/li')
video_list = []
for li in li_list:
    video = li.xpath('./div/a/@href')[0]
    detail_url = 'https://www.pearvideo.com/' + video
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    ret = re.findall(',srcUrl="(.*?)",',detail_page_text,re.S)[0]
    print(ret)
    video_list.append(ret)
   
defget_video_data (URL):
     return requests.get (URL = URL, headers = headers) .content 


DEF save_video (Data): 
    name = STR (the random.randint (0,9999)) + ' .mp4 ' 
    with Open (name, ' WB ' ) AS F: 
        f.write (data) 
    Print (name, " download success " ) 

# list binary data composed of four video 
all_video_data_list = pool.map (get_video_data, video_list) 
the pool = Pool (. 4 ) 
pool.map (save_video, all_video_data_list)
viewcode

 

 

 

 

 

 

Coroutine basis

  Asynchronous more efficient, and less off than the cost of overhead is small thread threads than processes, coroutines, so when reptiles, recommended the use of single-threaded asynchronous + coroutine way, can improve the lot of efficiency

  • event_loop : event loop, the equivalent of an infinite loop, we can put some special function registers (placed) on the event loop, when certain conditions are met, the function will be executed cyclically. Procedure is performed in the set sequence from the beginning to the end, the number of runs is set exactly. When writing asynchronous program, which is bound to run time-consuming part of the process is relatively long, we need to let the current control program, let it run in the background, so that another part of the program up and running first. When the program is running behind the completion of the main program also needs to promptly notify the task has been completed can the next step, but the time required for this process is uncertain, the main program needs continuous monitoring state, upon receipt of a message task completion , we begin the next step. It is this continuous loop monitor.
  • the coroutine : Chinese translation is called coroutines, refers to the process on behalf of the Association object type, we can register coroutine object to the event loop, it will be an event in the cycle is called Python often. We can use async keyword to define a method that will not be executed immediately when you call, but returns a coroutine object.
  • Task : task, which is further encapsulated to coroutine object contains the status of each task.
  • Future : on behalf of the future tasks to perform or not to perform, in fact, and the task is no essential difference.

   In addition, we also need to know the async / the await keyword, which is just emerging from Python 3.6, especially for defining the coroutine. Wherein, async define a coroutine, await a method for blocking pending execution. Official Documents

To test the following two parts of the code, we can set up their own server to guarantee a blocking time, using simple structures flask

from flask import Flask
import time

app = Flask(__name__)


@app.route('/test1')
def index_bobo():
    time.sleep(2)
    return 'Hello bobo'

@app.route('/test2')
def index_jay():
    time.sleep(2)
    return 'Hello jay'

@app.route('/test3')
def index_tom():
    time.sleep(2)
    return 'Hello tom'

if __name__ == '__main__':
    app.run(threaded=True)
flaskServer

 1. The use of substantially coroutine, asynico module

Import ASYNCIO 
the async DEF Request (URL):
     Print ( ' being requested: ' , URL)
     Print ( ' download success: ' , URL) 

C = Request ( ' www.baidu.com ' ) 

# instantiates a cyclic object event 
loop = asyncio.get_event_loop ()
 # Create a task object, the object is to package the coroutine object 
# task loop.create_task = (C) 
# another form of the method the object instance tasks 
task = asyncio.ensure_future (C) 

Print ( Task) 

# the coroutine object is registered to the event loop object, and we need to start the event loop objects 
loop.run_until_complete (task)

print(task)

2. The task object bound to the callback function

Import ASYNCIO 

the async DEF Request (url):
     Print ( ' is being requested: ' , url)
     Print ( ' download success: ' , url)
     return url 

# callback function must have a parameter: Task 
# task.result (): Object Task special coroutine inside the package corresponding to the object function return value 
DEF callbak (Task):
     Print ( ' ! the this the callback iS ' )
     Print (task.result ()) 

C = Request ( ' www.baidu.com ' ) 

# give The task object to bind a callback 
task = asyncio.ensure_future(c)
task.add_done_callback(callbak)

loop = asyncio.get_event_loop()
loop.run_until_complete(task)

3. Multi-task asynchronous coroutine

  Note here that, when time.sleep () simulation obstruction, will complain, because asynchronous Association process, can not appear relevant code does not support asynchronous, and before the function coroutines to add keywords await, otherwise it will error

from Time Import SLEEP
 Import ASYNCIO
 Import Time 
URLs = [ ' www.baidu.com ' , ' www.sogou.com ' , ' www.goubanjia.com ' ] 
Start = the time.time () 
the async DEF Request (URL):
     Print ( ' is being requested: ' , url)
     # multitasking asynchronous coroutine implementation, can not appear relevant code does not support asynchronous. 
    # SLEEP (2) 
    the await asyncio.sleep (2 )
     Print ( ' download success: ', URL) 

Loop = asyncio.get_event_loop ()
 # task list: placing a plurality of task objects 
Tasks = []
 for URL in URLs: 
    C = Request (URL) 
    Task = asyncio.ensure_future (C) 
    tasks.append (Task) 

Loop. run_until_complete (asyncio.wait (Tasks)) 

Print (time.time () - Start)

4. Induction coroutine multitasking application among reptiles

   When we use a simple application requests to the module reptile program, we found no effect of increasing, because the requests module does not support asynchronous operation , so we use aiohttp request

module does not support asynchronous requests

So you should use aiohttp request access

import requests
import asyncio
import time
import aiohttp
#单线程+多任务异步协程
urls = [
    'http://127.0.0.1:5000/jay',
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/tom'
]
#代理操作:
#async with await s.get(url,proxy="http://ip:port") as response:
async def get_pageText(url):
   async with aiohttp.ClientSession() as s:
      async with await s.get(url) as response:
           page_text =response.text the await ()
             # responds by means of a data parsing operation callback 
           return page_text
 # package callback function for data analysis 
DEF the parse (Task):
     # 1. fetch response data 
    page_text = task.result ()
     Print (page_text + ' , the upcoming data analysis !!! ' )
     # parsing write operation at that location 

Start = time.time () 
tasks = []
 for url in urls: 
    c = get_pageText (url) 
    task = asyncio.ensure_future (c)
     # to the task Object binding callback function for data analysis
    task.add_done_callback(parse)
    tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

Guess you like

Origin www.cnblogs.com/robertx/p/10951219.html