一、创建一个新的爬虫项目:
scrapy startproject tutorial
创建的项目目录如下:
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
二、自定义爬虫
自定义的爬虫需要继承scrapy.Spider类,放到spiders目录下:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
自定义Scrapy使用的爬虫类有一些特殊的属性
name:一个Spider独一无二的标识
start_requests():返回一个可遍历的url列表
parse():处理每一个url下载的html页面,response即返回的应答,parse通常用来提取数据,生成一个字典,提取新的url,创建新的可遍历列表
三、运行spider
进入到爬虫目录(博客开始时的目录),命令行中输入命令:
scrapy crawl quotes
这条命令会运行name为quotes的爬虫
对于上述例子来说,start_requests方法会返回scrapy.Request对象,每接收到一个应答,就会初始化一个response对象,调用parse方法去解析
除了定义start_requests方法,可以直接将要爬取的url定义为爬虫类的一个属性:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
四、提取数据
输入以下命令行,允许使用选择器提取数据(不需要运行我们自定义的爬虫)
scrapy shell "http://quotes.toscrape.com/page/1/"
会有这样的输出:
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>>
文档中举例的函数如下:
#提取名为title的元素,返回列表
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
#提取名为title的元素的文本,返回列表
>>> response.css('title::text').extract()
['Quotes to Scrape']
#提取名为title的元素的文本,同时只返回第一个提取结果
>>> response.css('title::text').extract_first()
'Quotes to Scrape'
#提取class为quote的div元素
>>> response.css("div.quote")
#选择class为next的li元素中的a子元素,提取其中的href标签的值
>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'
#使用正则表达式
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
#查看html页面
view(response)
#使用xpath
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'
Xpath是Scrapy选择器的基础,CSS选择器底层使用的就是Xpath
上述是通过shell命令提取数据,也可以在parse方法中提取数据,scrapy使用了yield关键字:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
运行这个爬虫,得到的结果如下:
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
上述例子只是返回了提取信息,也可以在parse中返回下一次要爬取的url,同样使用到了yield关键字
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
由于next_page可能是相对路径,所以使用urljoin保证是绝对路径,scrapy.Request方法的callback的值指定的是回调函数,当请求了这个url后,会调用这个方法
上述代码需要将相对url变为绝对url,response.follow方法会自动处理这种情况:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
如果有多个url需要提取,可以将上述提取url的代码更改为:
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)
response.follow会自动使用a元素的href属性,因此上述代码可进一步简化为:
for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)
response.follow只能处理单个url,不能一次处理一批url
scrapy可以自动管理url,因此不必担心重复爬取统一url,可以在命令后面添加-a选项,来传递参数给自定义的爬虫:
scrapy crawl quotes -o quotes-humor.json -a tag=humor
这些参数将传递给__init__函数并成为爬虫类的一个属性
五、存储数据
最简单的方式是使用下列命令:
scrapy crawl quotes -o quotes.json
爬取的数据会序列化为json进行存储,由于历史原因,Scrapy会将内容添加到某个文件的末尾,而不是覆盖,如果我们使用两次这条命令(没有删除对应的文件),将会得到一个损坏的json文件
对于简单项目来说,上述的存储命令(文件名、文件类型可以更改)基本可以满足需求,如果项目比较复杂,可以考虑使用Item Pipeline,在pipelines.py中定义