python笔记(爬虫 Scrapy 中间件 定制命令)

详细参考
一、中间件

  1. 下载中间件

    写中间件(创建在与settings同级的目录下):

    from scrapy.http import HtmlResponse
    from scrapy.http import Request
    
    class Md1(object):
    	@classmethod
    	def from_crawler(cls, crawler):
    		# This method is used by Scrapy to create your spiders.
    		s = cls()
    		return s
    
    	def process_request(self, request, spider):
    		#  在调用下载器获取结果时
    		# middleware.
    
    		# Must either:
    		# - return None: continue processing this request
    		# - or return a Response object
    		# - or return a Request object
    		# - or raise IgnoreRequest: process_exception() methods of
    		#   installed downloader middleware will be called
    		print('md1.process_request',request)
    		# 1. 返回Response
    		# import requests
    		# result = requests.get(request.url)
    		# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
    		# 2. 返回Request
    		# return Request('https://dig.chouti.com/r/tec/hot/1')
    
    		# 3. 抛出异常
    		# from scrapy.exceptions import IgnoreRequest
    		# raise IgnoreRequest
    
    		# 4. 对请求进行加工(*)
    		# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    
    		pass
    
    	def process_response(self, request, response, spider):
    		# 在调用下载器获取结果返回时
    		# Must either;
    		# - return a Response object
    		# - return a Request object
    		# - or raise IgnoreRequest
    		print('m1.process_response',request,response)
    		return response
    
    	def process_exception(self, request, exception, spider):
    		# 当通过下载器的请求出错时调用
    		# (from other downloader middleware) raises an exception.
    
    		# Must either:
    		# - return None: continue processing this exception
    		# - return a Response object: stops process_exception() chain
    		# - return a Request object: stops process_exception() chain
    		pass
    

    配置:

    	DOWNLOADER_MIDDLEWARES = {
       #'xdb.middlewares.XdbDownloaderMiddleware': 543,
    	# 'xdb.proxy.XdbProxyMiddleware':751,
    	'xdb.md.Md1':666,
    	'xdb.md.Md2':667,
    }
    

    应用:

    -添加 user-agent 
    -添加 代理 
    
  2. 爬虫中间件
    写中间件(创建在与settings同级的目录下):

    class Sd1(object):
    	# Not all methods need to be defined. If a method is not defined,
    	# scrapy acts as if the spider middleware does not modify the
    	# passed objects.
    
    	@classmethod
    	def from_crawler(cls, crawler):
    		# This method is used by Scrapy to create your spiders.
    		s = cls()
    		return s
    
    	def process_spider_input(self, response, spider):
    		# 在下载中间间执行完引擎再将结果交给爬虫中间件的时候执行
    		# middleware and into the spider.
    
    		# Should return None or raise an exception.
    		return None
    
    	def process_spider_output(self, response, result, spider):
    		# 在下载中间间执行完引擎再将结果交给爬虫中间件的时候,回调函数再次调用了Request方法或Items方法之后执行
    		# it has processed the response.
    
    		# Must return an iterable of Request, dict or Item objects.
    		for i in result:
    			yield i
    
    	def process_spider_exception(self, response, exception, spider):
    		# Called when a spider or process_spider_input() method
    		# (from other spider middleware) raises an exception.
    
    		# Should return either None or an iterable of Response, dict
    		# or Item objects.
    		pass
    
    	# 只在爬虫启动时,执行一次。
    	def process_start_requests(self, start_requests, spider):
    		# Called with the start requests of the spider, and works
    		# similarly to the process_spider_output() method, except
    		# that it doesn’t have a response associated.
    
    		# Must return only requests (not items).
    		for r in start_requests:
    			yield r
    

    配置:

    SPIDER_MIDDLEWARES = {
       # 'xdb.middlewares.XdbSpiderMiddleware': 543,
    	'xdb.sd.Sd1': 666,
    	'xdb.sd.Sd2': 667,
    }
    

    应用:

    - 深度
    - 优先级
    

二、定制命令

  1. 单爬虫运行:

    在与scrapy.cfg文件同级的目录下创建一个py文件

    扫描二维码关注公众号,回复: 6434993 查看本文章
    import sys
    from scrapy.cmdline import execute
    
    if __name__ == '__main__':
    	execute(["scrapy","crawl","chouti","--nolog"])
    
  2. 所有爬虫:

    1) 在spiders同级创建任意目录,如:commands
    2)在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)

    from scrapy.commands import ScrapyCommand
    from scrapy.utils.project import get_project_settings
    
    
    class Command(ScrapyCommand):
    
        requires_project = True
    
        def syntax(self):
            return '[options]'
    
        def short_desc(self):
            return 'Runs all of the spiders'
    
        def run(self, args, opts):
            spider_list = self.crawler_process.spiders.list()
            for name in spider_list:
                self.crawler_process.crawl(name, **opts.__dict__)
            self.crawler_process.start()
    

    3)在settings.py 中添加配置 COMMANDS_MODULE = ‘项目名称.目录名称’
    4)在项目目录执行命令:scrapy crawlall

猜你喜欢

转载自blog.csdn.net/qq_41433183/article/details/89931627
今日推荐