scrapy中文参考文档:https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html
scrapy英文参考文档:https://doc.scrapy.org/en/latest/intro/tutorial.html
1.安装scrapy
在你完全配置好Python虚拟环境后,接下来进入到你已配置好的虚拟环境中安装你所需要的包
pip install twisted可能会失败,你需要通过其它途径下载并将其放到你的虚拟环境中
首先安装pypivwin32, 运行scrapy的时候,如果没有安装pypiwin32,会出异常。
这几个需要单独pip install 进行安装
pip install constantly
pip install Automat
pip install hyperlink
pip install incremental
pip install zope.interface
以上安装完毕之后,再安装scrapy: pip install scrapy
pip install Automat
pip install hyperlink
pip install incremental
pip install zope.interface
以上安装完毕之后,再安装scrapy: pip install scrapy
2.创建scrapy项目
3.配置虚拟环境的解释器
4.scrapy项目结构
5.简单举例
# -*- coding: utf-8 -*-
#novel.py
import scrapy
from ..items import NovelspiderItem
class NovelSpider(scrapy.Spider):
name = 'novel'
allowed_domains = ['readnovel.com']
# 注意检查一下start_urls ,有时默认的是http开头的
start_urls = ['https://www.readnovel.com/rank/hotsales/']
number = 2
def parse(self, response):
self.number += 1
# 解析response对象
all_divs = response.xpath('//div[@class="book-mid-info"]')
for div in all_divs:
# extract_first(默认值):尝试获取第一个元素,获取失败会采用默认值。
href = div.xpath('.//h4/a/@href').extract_first(default='')
detail_url = 'https://www.readnovel.com' + href
title = div.xpath('.//h4/a/text()').extract_first(default='')
author = div.xpath('.//p[@class="author"]/a[contains(@class, "name")]/text()').extract_first('')
# meta参数,可以向回调函数parse_detail_page传递参数。
# 将每一个详情页的请求对象,yield到调度器的队列中,等待被执行。
# dont_filter=False:去重
yield scrapy.Request(url=detail_url, callback=self.parse_detail_page, meta={'title': title, 'author': author}, dont_filter=False)
# 获取下一页的连接,然后构造一个请求对象,将这个request对象yield到调度器的队列中。
# for x in range(2, 20):
# print('开始爬取:', x)
if self.number <= 3:
print('11111111', self.number)
next_href = 'https://www.readnovel.com/rank/hotsales?pageNum={}'.format(self.number)
yield scrapy.Request(url=next_href, callback=self.parse)
# 所有数据解析完毕了,最后统一yield Item
novel = NovelspiderItem()
novel["title"] = title
novel["author"] = author
# print(title, author)
yield novel
def parse_detail_page(self, response):
# response.meta获取字典中的键值对。
print('---', response.url, response.meta['title'], response.meta['author'])
# -*- coding: utf-8 -*-
#items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class NovelspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
scrapy是不支持断点调试的,但是可以通过debug.py来进行操作
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'novel'])
# 在此处单击右键Debug
或者在cmd中敲命令scrapy crawl name
如果启动爬虫后,出现下列情况:robots.txt
则你需要去settings.py中将ROBOTSTXT_OBEY的值设置为 False即可.
# Scrapy框架默认遵守robots.txt协议规则,robots规定了一个网站中,哪些地址可以请求,哪些地址不能请求.
#默认是True,设置为False不遵守这个协议