Basic framework of scrapy build python

First, create a project

# In the command line input 
scrapy startproject xxx # Create a project

Second, write the item file

# Crawling need to write the name of the field
name = scrapy.Field()  #例

Third, access to spiders crawling write file

① direct file to manually write reptile named
        Create a new file to .py 
② crawler file created by the command mode gensipder yyy scrapy
" xxx.com " name can not be on the same project name, crawling domain area

Fourth, write reptiles file

URL domain when start_urls # crawling reptiles first execution
The initial model object transducer 
iTeam
= TencentItem () # introduced into the document iTeam iTeam [ ' XXX ' ] = each.xpath ( " ./td [. 1] / A / text () " ) .extract () [ 0 ]
#xpath return a list of selective, Extract convert it to a string, and then the first character in the list out.
yield iTeam into the pipe file # # Due to crawl the pages more than one page, you need to request a callback function to call. # Resends a request to the scheduler queues, a queue, to the download manager # After each finished processing a page request is sent on the next page request yield scrapy.Request (url, callback = self.parse)
# function name parse, there is a request will trigger a callback function, yield sent to the dispatcher
// write out each cycle finished executing.

Fifth, write pipe file

First, define the initialization method
        def __init__(self):
            self.filename=open("xxx", "w")
    def process_item(self, iteam, spider):
        dict (iteam) # dictionary is converted into a format python
        json.dumps (dict (iTeam), ensure_ascii = False) + " \ n- "     # format is converted into json
        self.filename.write (text.encode ( " UTF-8 " )) # If the problem can not be written, to add .encode ( " UTF-8 " )
         return iTeam
    Close the file
        def close_spider(self, spider):
            self.filename.close()

Sixth, the configuration settings file
    found ITEM_PIPELINES configuration item, the configuration file to the pipeline

Seven, set the request header
    found DEFAULT_REQUEST_HEADERS configuration settings file

Eight, run the program

scrapy crawl file name

crawlspider create an application rewrite

First, create a project

scrapy gensipder -t crawl tencent tencent.com
    # Crawlspider class and rule import rules
    #from scrapy.spider import CrawlSpider, Rule
    # Class rule matching incoming links, in line with the rules used to extract links
    #from scrapy.linkextractors import LinkExtractor
    #from TencentSpider.items import TencentItem
    class TencentSpider(CrawlSpider):    #继承CrawlSpider类
        name = " xxx "     # reptile name
        allow_domains = [] # field control crawling reptile
        start_urls = []
        # Regular matching rules, Response page information in compliance with the rules of data
        pagelink = LinkExtractor(allow=("start=\d+"))
        # Batch method invocation request
        rules = [
            #pagelink = url, follow the link method call, whether to follow up links True
            Rule(pagelink, callback = "parseTencent", follow = True)
        ]
    
Betta
    Python json format to convert the format, data segment is a list of
    data = json.loads(response.text['data']
    

 

Guess you like

Origin www.cnblogs.com/jake-jin/p/11787195.html