scrapy 学习2

接着上篇文章，自定义了一个spider，scrapy 的schedules（调度器）调用了自定义的spider的start_requests 方法，该方法

会返回一个response类。后面定义的parse()方法是继承的父类方法，所以他是一个自动调用的回掉函数。不需要手动调用。

今天内容：

1、你可以在shell里直接调用scrapy来访问url

    scrapy shell "http://quotes.toscrape.com/page/1/"

在该shell可以使用css 选择器来定位到要选择的元素（此刻使用response这个对象）

>>> response.css('title')

如果想提取出title中的数据 可以加入一个extract()方法 如：response.css('title'). extract()

还支持正则匹配 re() 如：response.css('title::text').re('')

2、数据的存储

 最简单的方式使用json来存储 scrapy crawl *** -o ***.json 这样最后会生成一个json文件来存储提取到

的数据。

 问题来了，下次再执行该命令行的时候会导致新的数据写到该json文件中，如何在该json下面接着写入呢

使用 json lines 就可以了

scrapy crawl *** -o ***.jl

if you want to perform more complex things with the scraped items, you can write an Item Pipeline.

如果你想用scrapy items执行更为复杂的事情 你可以使用Item Pipeline了。

3、提取出链接 href

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

可以使用

response.css('li.next a::attr(href)').extract_first()这样提取出来的就是 ‘/page/2/’了

示例代码

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls=[
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text':quote.css('span.text::text').extract_first(),
                'author':quote.css('small.author::text').extract_first(),
                'tags':quote.css('div.tags a.tag::text').extract(),
            }
        next_page=response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page=response.urljoin(next_page)
            yield scrapy.Request(next_page,callback=self.parse)

大致内容为再解析第一个界面的时候，把下一页的链接提取出来，然后通过Request再次发起请求

这里要说一下，因为提取出来的为相对url所以这里使用了response.urljoin（）相当于把提取出的url拼接进去。

在文档中介绍scrapy的链接机制说到

谷歌翻译：你在这里看到的是Scrapy的以下链接的机制：当你在回调方法中产生一个请求时，

Scrapy会调度要发送的请求，并注册一个回调方法，以在该请求完成时执行。

好了，以上就是实现的循环遍历下一页直到最后一页。其实前提是每页的格式一样才行，不一样的话还

需要自己定义相应的提取数据的法则。

最后来个格式不一样的，说明一下，开始获取的是这个页面

我们点击查看作者的详细信息 就是名字后面的（about）打开的页面是这样滴

我们要提取出上面我画框的信息，你看这两个页面的布局不同吧，所以要自己定义解析方式(ps 我所举的例子

都是官网上给出的，如果嫌我说的不明白，大家可以直接去看文档)

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href),
                                 callback=self.parse_author)

        # follow pagination links
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

有意思的是，在请求页面的时候他不会重复请求同一个页面，这样邮箱的避免了数据的重叠。他可以这样配置

 This can be configured by the settingDUPEFILTER_CLASS.

猜你喜欢