Scrapy-02-item pipe, shell, selector

Scrapy-02


 

  • item pipeline:
    • scrapy provides item object to save the data crawling, and its use is similar to the dictionary, however, compared to the dictionary, item more than an additional protection mechanism, to avoid typos and errors defined fields.
    • item created need to inherit scrapy.Item class, and field definitions Field inside. (We are crawling Tomb notes, only the title of the article and the contents of two fields)
    • Defined item, modify item.py in:
    •  1 # -*- coding: utf-8 -*-
       2 
       3 # Define here the models for your scraped items
       4 #
       5 # See documentation in:
       6 # https://doc.scrapy.org/en/latest/topics/items.html
       7 
       8 import scrapy
       9 
      10 
      11 class BooksItem(scrapy.Item):
      12     # define the fields for your item here like:
      13     # name = scrapy.Field()
      14     title = scrapy.Field()
      15     content = scrapy.Field()
    • Analytical response and use of the item:
       1 # -*- coding: utf-8 -*-
       2 import scrapy
       3 from ..items import BooksItem
       4 
       5 class DmbjSpider(scrapy.Spider):
       6     name = 'dmbj'
       7     allowed_domains = ['www.cread.com']
       8     start_urls = ['http://www.cread.com/chapter/811400395/69162457.html/']
       9 
      10     def parse(self, response):
      11         item = BooksItem()
      12         item['title'] = response.xpath('//h1/text()').extract_first()
      13         item['content'] = response.xpath('//div[@class="chapter_con"]/text()').extract_first()
      14         yield item
      15
       1 # -*- coding: utf-8 -*-
       2 
       3 # Define your item pipelines here
       4 #
       5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
       6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
       7 
       8 
       9 class BooksPipeline(object):
      10     def process_item(self, item, spider):
      11         with open('files/{}.txt'.format(item['title']), 'w+') as f:
      12             f.write (Item [ ' Content ' ])
       13 is          return Item
       14  
      15      DEF open_spider (Self, Spider):
       16          # call starts when the crawler 
      . 17          Pass 
      18 is  
      . 19      DEF close_spider (Self, Spider):
       20 is          # call off when the crawler 
      21 is          Pass

       Import parse method defined in class Item required, the class is instantiated, the instantiated class to a dictionary of his manner, it directly assigned dictionary class key value must have a corresponding field name.

    • Then yield to its use
    • The method defined in three pipline.py inside:
      • process_item:
        • For item parse returned for processing and then return out
      • open_spider:
        • When the reptile started automatically call
      • close_spider:
        • Reptile closed when the call
    • pipline which defines the pipline need to use, get setting which was about ITEM_PIPELINES dictionary activation
    • ITEM_PIPELINES = {
      'books.pipelines.BooksPipeline': 300,
      }
  • shell
    • scrapy shell is an interactive debugging tool scrapy provided ipython if the current environment is installed, it will default to call ipython, can also be set at scrapy.cfg the setting: shell = ipython
    • Use scrapy shell:
      • Input terminal: scrapy shell [url] // url: URL crawling like, may not be added (which may be a local file path so as to write)
    •  fetch:
      • receiving a fetch url, constituting a new request object, the new response returns 

Guess you like

Origin www.cnblogs.com/ivy-blogs/p/10909219.html