在Python虚拟环境中使用scrapy框架

scrapy中文参考文档:https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

scrapy英文参考文档:https://doc.scrapy.org/en/latest/intro/tutorial.html

1.安装scrapy

在你完全配置好Python虚拟环境后,接下来进入到你已配置好的虚拟环境中安装你所需要的包

pip install twisted可能会失败,你需要通过其它途径下载并将其放到你的虚拟环境中

首先安装pypivwin32, 运行scrapy的时候，如果没有安装pypiwin32，会出异常。

这几个需要单独pip install 进行安装

pip install constantly
pip install Automat
pip install hyperlink
pip install incremental
pip install zope.interface
以上安装完毕之后，再安装scrapy: pip install scrapy

2.创建scrapy项目

3.配置虚拟环境的解释器

4.scrapy项目结构

5.简单举例

# -*- coding: utf-8 -*-
#novel.py
import scrapy
from ..items import NovelspiderItem


class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['readnovel.com']
    # 注意检查一下start_urls ,有时默认的是http开头的
    start_urls = ['https://www.readnovel.com/rank/hotsales/']
    number = 2

    def parse(self, response):
        self.number += 1
        # 解析response对象
        all_divs = response.xpath('//div[@class="book-mid-info"]')
        for div in all_divs:
            # extract_first(默认值)：尝试获取第一个元素，获取失败会采用默认值。
            href = div.xpath('.//h4/a/@href').extract_first(default='')
            detail_url = 'https://www.readnovel.com' + href
            title = div.xpath('.//h4/a/text()').extract_first(default='')
            author = div.xpath('.//p[@class="author"]/a[contains(@class, "name")]/text()').extract_first('')
            # meta参数，可以向回调函数parse_detail_page传递参数。
            # 将每一个详情页的请求对象，yield到调度器的队列中，等待被执行。
            #  dont_filter=False:去重
            yield scrapy.Request(url=detail_url, callback=self.parse_detail_page, meta={'title': title, 'author': author}, dont_filter=False)
        # 获取下一页的连接，然后构造一个请求对象，将这个request对象yield到调度器的队列中。
        # for x in range(2, 20):
        #     print('开始爬取：', x)
        if self.number <= 3:
            print('11111111', self.number)
            next_href = 'https://www.readnovel.com/rank/hotsales?pageNum={}'.format(self.number)
            yield scrapy.Request(url=next_href, callback=self.parse)

            # 所有数据解析完毕了,最后统一yield Item
            novel = NovelspiderItem()

            novel["title"] = title
            novel["author"] = author

            # print(title, author)

            yield novel

    def parse_detail_page(self, response):
        # response.meta获取字典中的键值对。
        print('---', response.url, response.meta['title'], response.meta['author'])

# -*- coding: utf-8 -*-
#items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class NovelspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    title = scrapy.Field()
    author = scrapy.Field()

scrapy是不支持断点调试的,但是可以通过debug.py来进行操作

from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'novel'])
# 在此处单击右键Debug

或者在cmd中敲命令scrapy crawl name

如果启动爬虫后,出现下列情况:robots.txt

则你需要去settings.py中将ROBOTSTXT_OBEY的值设置为 False即可.

# Scrapy框架默认遵守robots.txt协议规则,robots规定了一个网站中,哪些地址可以请求,哪些地址不能请求.

#默认是True,设置为False不遵守这个协议

在Python虚拟环境中使用scrapy框架

猜你喜欢