Ajax crawling combat street shoot Mito headlines

Reprinted from: the static seek  >>  [A Python 3 Web crawler developed combat] 6.4 Analysis of Ajax crawling headlines today beat Street Mito

 

 

 The above article in several key areas, recorded:

  • Skills class

  After reviewing the Web, select Check --network, and then we clear all information requests, refresh the page, the first page of this request is the backbone of this time, the basic code in it, but the data may be an Ajax request in rendering js go, this time the data on the page that we want to see replicated in the network's Response Body Preview tab to search through, if the result exists, we directly request the page, you want to be able to get the information, if not, the data is by request, loading the rendering mode, where we need to analyze the link request data, first of all search mode is switched to xhr, the refresh request, the interception is here to ajax request, whether it is necessary to see where the data, such as exists, analyzes the request mode, data can be acquired json.

  The street shooting needs, although the analysis of the ajax request, but the request header encryption something there, you may want to read the source code js part to crack, so the following sections do not write their own code

  • Code classes
    • Processing, construction folder, select the file name of the picture
      . 1  Import OS
       2  # hashlib - MD5 effect to image a name of one, repeat the two images can be taken 
      . 3  from hashlib Import MD5
       . 4  
      . 5  
      . 6  DEF save_image (Item):
       . 7      # build a folder for storing images 
      . 8      IF  Not os.path.exists (item.get ( ' title ' )):
       . 9          os.mkdir (item.get ( ' title ' ))
       10      the try :
       . 11          Response = requests.get (item.get ( ' Image ' ))
       12 is          if== 200 is response.status_code :
       13 is              # storage path configured picture 
      14              file_path = ' . {0} / {{2}}. 1 ' .format (item.get ( ' title ' ), MD5 (response.content) .hexdigest (), ' JPG ' )
       15              IF  Not os.path.exists (file_path):
       16                  # Content requested data back is binary, it can be directly written to the file wb 
      . 17                  with Open (file_path, ' wb ' ) AS F :
       18 is                      f.write (response.content)
       . 19              the else :
       20 is                  Print ('Already Downloaded', file_path)
      21     except requests.ConnectionError:
      22         print('Failed to Save Image')
    • Process using multi-process pool
      . 1  from multiprocessing.pool Import Pool
       2  
      . 3  
      . 4  DEF main (offset):
       . 5      JSON = the get_page (offset)
       . 6      for Item in get_images (JSON):
       . 7          Print (Item)
       . 8          save_image (Item)
       . 9  
      10  # fixed variables we try uppercase variables using storage for parametric programming 
      . 11 GROUP_START. 1 =
       12 is GROUP_END = 20 is
       13 is  
      14  IF  the __name__ == ' __main__ ' :
       15      # open thread pool 
      16     = the pool Pool ()
       . 17      Groups = ([X * 20 is for X in Range (GROUP_START, GROUP_END +. 1 )])
       18 is  
      . 19      # pool is the most important map method, the data is mapped to the process list 
      20 is      pool.map (main, Groups)
       21 is      pool.close ()
       22 is      pool.join ()

Guess you like

Origin www.cnblogs.com/waws1314/p/12502404.html