First, create a project
# In the command line input
scrapy startproject xxx # Create a project
Second, write the item file
# Crawling need to write the name of the field
name = scrapy.Field() #例
Third, access to spiders crawling write file
① direct file to manually write reptile named Create a new file to .py
② crawler file created by the command mode gensipder yyy scrapy " xxx.com " name can not be on the same project name, crawling domain area
Fourth, write reptiles file
URL domain when start_urls # crawling reptiles first execution The initial model object transducer
iTeam = TencentItem () # introduced into the document iTeam iTeam [ ' XXX ' ] = each.xpath ( " ./td [. 1] / A / text () " ) .extract () [ 0 ]
#xpath return a list of selective, Extract convert it to a string, and then the first character in the list out. yield iTeam into the pipe file # # Due to crawl the pages more than one page, you need to request a callback function to call. # Resends a request to the scheduler queues, a queue, to the download manager # After each finished processing a page request is sent on the next page request yield scrapy.Request (url, callback = self.parse)
# function name parse, there is a request will trigger a callback function, yield sent to the dispatcher
// write out each cycle finished executing.
Fifth, write pipe file
First, define the initialization method def __init__(self): self.filename=open("xxx", "w") def process_item(self, iteam, spider): dict (iteam) # dictionary is converted into a format python json.dumps (dict (iTeam), ensure_ascii = False) + " \ n- " # format is converted into json self.filename.write (text.encode ( " UTF-8 " )) # If the problem can not be written, to add .encode ( " UTF-8 " ) return iTeam Close the file def close_spider(self, spider): self.filename.close()
Sixth, the configuration settings file
found ITEM_PIPELINES configuration item, the configuration file to the pipeline
Seven, set the request header
found DEFAULT_REQUEST_HEADERS configuration settings file
Eight, run the program
scrapy crawl file name
crawlspider create an application rewrite
First, create a project
scrapy gensipder -t crawl tencent tencent.com # Crawlspider class and rule import rules #from scrapy.spider import CrawlSpider, Rule # Class rule matching incoming links, in line with the rules used to extract links #from scrapy.linkextractors import LinkExtractor #from TencentSpider.items import TencentItem class TencentSpider(CrawlSpider): #继承CrawlSpider类 name = " xxx " # reptile name allow_domains = [] # field control crawling reptile start_urls = [] # Regular matching rules, Response page information in compliance with the rules of data pagelink = LinkExtractor(allow=("start=\d+")) # Batch method invocation request rules = [ #pagelink = url, follow the link method call, whether to follow up links True Rule(pagelink, callback = "parseTencent", follow = True) ] Betta Python json format to convert the format, data segment is a list of data = json.loads(response.text['data']