Python爬虫Scrapy框架详解

Scrapy

Scrapy思维导图下载链接

Scrapy框架安装

lxml

  • https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

pyopenssl

  • https://pypi.python.org/pypi/pyOpenSSL#downloads

twisted

  • https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

pywin32

  • https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/

scrapy

  • pip install scrapy

Scrapy命令行详解

scrapy -h

  • scrapy -h

    • 全局命令:

      • startproject

        • scrapy startproject myproject [project_dir]

          • 创建Scrapy项目,myproject项目名,在project_dir目录下创建一个Scrapy项目

          • cd project_dir

            • 进入新项目目录

            • tree

              • 树状展示
        • scrapy startproject zhihu

      • genspider

        • scrapy gensipder -l

          • 查看爬虫模板

          • scrapy gensipder -t crawl zhihu.com

            • 使用crawl模板生成zhihu.com爬虫
        • scrapy genspider mydomain mydomain.com

          • 创建一个新的爬虫
      • settings

        • scrapy settings [options]
      • runspider

        • scrapy runspider <spider_file.py>
      • shell

        • scrapy shell [url]

          • 命令行交互工具
          • scrapy shell http://www.baidu.com.html
      • fetch

        • scrapy fetch

          • 下载某个网页

          • scrapy fetch http://www.baidu.com

            • scrapy fetch --nolog http://www.baidu.com

              • 不生成日志
            • scrapy fetch --nolog --header http://www.baidu.com

              • 不生成日志、显示头部信息
            • scrapy fetch --nolog --header --no-redirecthttp://www.baidu.com

              • 不生成日志、显示头部信息、不允许重定向
      • view

        • scrapy view

          • 保存某个网页并用浏览器打开

          • scrapy view http://www.baidu.com

            • 调试、测试工具
      • version

        • scrapy version [-v]

          • 输出一些依赖库的版本
    • 项目命令:

      • crawl

        • scrapy crawl

          • 运行某爬虫
          • scrapy crawl zhihu.com
      • check

        • scrapy check [-l]

          • 检查代码是否有错误
          • scrapy check zhihu.com
      • list

        • scrapy list

          • 返回项目sipyder名称
      • edit

        • scrapy edit

          • 命令行编辑sipder
          • scrapy edit zhihu.com
      • parse

        • scrapy parse [options]
      • bench

        • scrapy bench

          • 测试,反应当前环境运行性能

Scrapy选择器用法

response.selector.css

  • response.selector.css(‘title::title’).extract_first()

    • 返回title的内容

response.selector.xpath

  • response.selector.xpath(’//title/text()’)

    • 返回一个list
  • response.selector.xpath(’//title/text()’).extract_first()

    • 返回title的内容

response.xpath(’//title/text()’)

  • 返回list的selector对象

    • response.xpath(’//a/text()’)

      • 获取a标签里面的文本
    • response.xpath(’//a/@href

      • 获取a标签href属性
    • response.xpath(’//title/text()’).extract_first()

      • 返回title的内容
    • response.xpath(’//div[@id=‘images’]’)

      • 拿到div

      • response.xpath(’//div[@id=‘images’]’).css(’'img)

        • 迭代拿到div里面的img标签(selector)

        • response.xpath(’//div[@id=‘images’]’).css(‘img::attr(src)’)

          • ::双冒号表示里面的属性attr(属性名),获取到图片的属性名(selector)

          • response.xpath(’//div[@id=‘images’]’).css(‘img::attr(src)’).extract()

            • 获取div里面img标签src的值(extract_first返回第一个,extract返回多个)

            • response.xpath(’//div[@id=‘images’]’).css(‘img::attr(src)’).extract(default=’ ')

              • 如果拿不到,则返回default里面的值,避免报错
    • response.xpath(’’//a@href)

      • 返回a链接里的href值(selector)

      • response.xpath(’’//a@href).extract()

        • 返回a链接里的href值
    • response.xpath(’//a/text()’)

      • 获取文本要加/text(),返回a的值(selector)

      • response.xpath(’//a/text()’).extract()

        • 返回a的文本值列表
    • 高级查找

      • response.xpath(’//a[contains(@href,“images”)]/@href’)

        • 属性名称包含imge的a超链接

        • contains(属性名,值)

        • [contains(@href,“image”)]

        • (selector)

        • response.xpath(’//a[contains(@href,“images”)]/@href’).extract()

          • 拿到属性
      • response.xpath(’//a[contains(@href,“images”)]/img/@src’).extrct()

        • 拿到a标签href里面的image里面的src属性

response.css(‘title::title’)

  • 返回list的selector对象

    • reposnse.css(a::text)

      • 获取a标签里面的内容
    • reponse.css(‘a::attr(href)’)

      • 获取a标签里面的href属性
    • response.css(‘title::title’).extract_first()

      • 返回title的内容
    • reponse.css(‘a::attr(href)’)

      • 返回a链接里的href值(selector)

      • reponse.css(‘a::attr(href)’).extract()

        • 返回a链接里href值
    • reponse.css(‘a::text’)

      • 获取文本要加text,返回a的值(selector)

      • reponse.css(‘a::text’).extract()

        • 返回a的文本值列表
    • 高级查找

      • response.css(‘a[href*=image]::attr(href)’)

        • 匹配a标签里面包含image的超链接(selector)

        • response.css(‘a[href*=image]::attr(href)’).extract()

          • 拿到超链接
      • response.css(‘a[href*=image]img::attr(src)’).extract()

        • 拿到a标签href里面的image里面的src属性

正则表达式

  • response.css(‘a::text’).re(‘Name:(.*)’)

    • 提取a标签里面内容(Name:)后的内容.返回多个

    • My image 1

    • My image 2

    • My image 3

    • response.css(‘a::text’).re_first(‘Name:(.*)’)

      • 提取a标签里面内容(Name:)后的内容的第一个值

      • My image 1

      • response.css(‘a::text’).re_first(‘Name:(.*)’).strip()

        • 提取a标签里面内容(Name:)后的内容的第一个值并且去除空格
        • My image 1

Spiders

name

allowed_domains

start_urls

custom_settings

  • User-Agent的设置(settings默认无)

crawler

settings

logger

from_crawler(crawler,* args,** kwargs

start_requests()

parse(回应)

log(message [,level,component ]

closed(原因)

Downloader Middleware

process_request(request, spider)

  • class ProxyMiddleware(object):
    logger=logging.getLogger(name)

    def process_request(self,request, spider):
    self.logger.debug(‘Using Proxy’)
    request.meta[‘proxy’]=‘http://127.0.0.1:8080’
    return None

  • DOWNLOADER_MIDDLEWARES = {
    ‘httpbintest.middlewares.HttpbintestDownloaderMiddleware’: 543,
    }

process_response(request, response, spider)

  • def process_response(self,request, response, spider):
      response.status=201
      return response
    

process_exception(request, exception, spider)

Item Pipeline

open_spider(self,spider)

close_spider(self,spider)

from_crawler(cls,spider)

process_item(self,item,spider )

  • 将项目写入JSON文件

    • class JsonWriterPipeline(object):

    def open_spider(self, spider):
    self.file = open(‘items.jl’, ‘w’)

    def close_spider(self, spider):
    self.file.close()

    def process_item(self, item, spider):
    line = json.dumps(dict(item)) + “\n”
    self.file.write(line)
    return item

激活项目管道组件

  • ITEM_PIPELINES = {
    ‘myproject.pipelines.PricePipeline’: 300,
    ‘myproject.pipelines.JsonWriterPipeline’: 800,
    }

验证抓取的数据(检查项目是否包含某些字段)

  • 价格验证

    • class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
    if item.get(‘price’):
    if item.get(‘price_excludes_vat’):
    item[‘price’] = item[‘price’] * self.vat_factor
    return item
    else:
    raise DropItem(“Missing price in %s” % item)

清理HTML数据

  • class TextPipeline(object):
    def init(self):
    self.limit = 50

    def process_item(self, item, spider):
    if item[‘text’]:
    if len(item[‘text’]) > self.limit:
    item[‘text’] = item[‘text’][0:self.limit].rstrip() + ‘…’
    return item
    else:
    return DropItem(“Missing Text”)

检查重复项(并将其删除)

  • 重复过滤器

    • class DuplicatesPipeline(object):

    def init(self):
    self.ids_seen = set()

    def process_item(self, item, spider):
    if item[‘id’] in self.ids_seen:
    raise DropItem(“Duplicate item found: %s” % item)
    else:
    self.ids_seen.add(item[‘id’])
    return item

拍摄项目的屏幕截图

  • class ScreenshotPipeline(object):
    “”“Pipeline that uses Splash to render screenshot of
    every Scrapy item.”""

    SPLASH_URL = “http://localhost:8050/render.png?url={}”

    def process_item(self, item, spider):
    encoded_item_url = quote(item[“url”])
    screenshot_url = self.SPLASH_URL.format(encoded_item_url)
    request = scrapy.Request(screenshot_url)
    dfd = spider.crawler.engine.download(request, spider)
    dfd.addBoth(self.return_item, item)
    return dfd

    def return_item(self, response, item):
    if response.status != 200:
    # Error happened, return item.
    return item

      # Save screenshot to file, filename will be hash of url.
      url = item["url"]
      url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
      filename = "{}.png".format(url_hash)
      with open(filename, "wb") as f:
          f.write(response.body)
    
      # Store filename in item.
      item["screenshot_filename"] = filename
      return item
    

存储在数据库中

  • 将项目写入MongoDB

    • class MongoPipeline(object):

    collection_name = ‘scrapy_items’

    def init(self, mongo_uri, mongo_db):
    self.mongo_uri = mongo_uri
    self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
    return cls(
    mongo_uri=crawler.settings.get(‘MONGO_URI’),
    mongo_db=crawler.settings.get(‘MONGO_DATABASE’, ‘items’)
    )

    def open_spider(self, spider):
    self.client = pymongo.MongoClient(self.mongo_uri)
    self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
    self.client.close()

    def process_item(self, item, spider):
    self.db[self.collection_name].insert_one(dict(item))
    return item

发布了20 篇原创文章 · 获赞 1 · 访问量 263

猜你喜欢

转载自blog.csdn.net/weixin_43555997/article/details/104176281
今日推荐