安装scrapy框架
-
安装
scrapy
: 通过pip install scrapy
-
如果是在Windows上面,还需要安装
pypiwin32
,如果不安装,那么以后运行scrapy项目的时候会报错。安装方式:pip install pypiwin32
。 -
如果是在Ubuntu下,还需要安装一些第三方库:`sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev`
创建爬虫
-
创建项目:
scrapy startproject <spider_name>
-
创建爬虫:cd到项目所在的路径,创建爬虫,执行命令
scrapy genspider [spider_name] [spider_domains]
注意!爬虫名字不能和项目名字相同
项目目录结构
-
items.py: 用来从存放爬虫爬取下来的模型,也就是定义需要爬取的数据字段
-
middlewares.py: 用来存放各种中间件的地方
-
pipelines.py: 存储items模型数据的,比如存到磁盘,或者写入数据库
-
spiders: 存放爬虫文件
-
settings.py: 爬虫的一些配置信息(比如请求头,多久发送一次请求,ip代理池等)
-
scrapy.cfg: 项目的配置文件
运行爬虫
-
scrapy crawl [spider_name]
-
每次运行的时候都要到终端去执行,这样在调试的时候不太方便;这项目下创建一个文件
run_spider.py
文件,这里面写上:
from scrapy import cmdline
# cmdline.execute('scrapy crawl [spider_name]'.split()) # 运行的命令是一个切割的列表
# 等价于
cmdline.execute(['scrapy', 'crawl', '[spider_name]'])
Spider下的爬虫文件
-
在使用`scrapy genspider <spider_name> <spider_domains> 创建好一个爬虫文件后,爬虫文件如下:
# -*- coding: utf-8 -*-
import scrapy
class TestSpider(scrapy.Spider):
name = 'test' # 爬虫名字 [scrapy crawl test] 此命令运行爬虫
allowed_domains = ['www.baidu.com'] # 运行爬虫爬取的范围
start_urls = ['http://www.baidu.com'] # 爬虫最开始爬取的url
def parse(self, response):
print(response) # 接收到的response响应对象
print(response.body) # 二进制数据,对应request的response.content
print(response.text) # 文本数据,对应request的response.text
pass
-
自动帮我们定义了一个TestSpider类,继承自scrapy.Spider类
-
name
: 爬虫运行时的名字scrapy crawl name
-
allwoed_domains
允许爬去的范围,scrapy运行的速度太快,防止爬到别的网站去,定义了此爬虫只能在这个域名下面爬取数据 -
start_urls: 是一个列表,爬虫最开始爬取的url
-
parse方法: 当爬虫从互联网上请求到数据之后,返回的响应数据会默认调用这个方法来解析,参数
response
就是接收到的响应-
response.text 返回文本数据,解码好的文本
-
response.body 返回网页二级制数据,需要我们手动指定编码
-
response可以调用的解析方法
-
xpath: response可以直接调用xpath来来解析数据返回一个selector对象
print(resposne.xpath('//title/text'))
# 返回一个xpath的selector对象
[<Selector xpath='//title/text()' data='百度一下,你就知道'>]
# 我们需要从这个selector对象中取出数据时,使用extract()提取全部文本内容,等价与getall()
# extract_first()提取一个文本内容,等价于get() -
-
-
parse
方法名不能更改,这是默认的start_url请求默认的回调函数,返回的响应会调用这个函数来解析,没有会报错
items.py文件
-
定义我们需要爬取的字段信息,可以理解为字典,用法和字典是一样的
import scrapy
class MyspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
baidu_title = scrapy.Field() # 定义爬取的字段
-
使用items:在爬虫文件中
from ..items import MyspiderItem
# -*- coding: utf-8 -*-
import scrapy
from ..items import MyspiderItem
class TestSpider(scrapy.Spider):
name = 'test'
# allowed_domains = ['www.baidu.com']
start_urls = ['http://www.baidu.com']
def parse(self, response):
items = MyspiderItem() # 实例化items,items的使用方法和字典的使用方法是一样的
items['baidu_title'] = response.xpath('//title/text()').extract_first()
-
注意,当我们在items文件中定义了需要爬取的字段后,在实例化使用items时,如果使用的字段
在items中没有定义,就会报错
pipelines.py 文件
-
管道文件,用于处理Spider返回的数据,保存到本地或者数据库等
class MyspiderPipeline(object):
def open_spider(self, spider): # 爬虫打开的时候执行,Spider就是你所编写的爬虫类
pass
def process_item(self, item, spider): # 处理item
return item # 这里的return必须有
def close_spider(self, spider): # 爬虫关闭的时候执行
pass
-
items怎么返回数据到pipelines中呢?在爬虫文件中yield items就可以了
# -*- coding: utf-8 -*-
import scrapy
from myspider.items import MyspiderItem
class TestSpider(scrapy.Spider):
name = 'test'
# allowed_domains = ['www.baidu.com']
start_urls = ['http://www.baidu.com']
def parse(self, response):
items = MyspiderItem()
items['baidu_title'] = response.xpath('//title/text()').extract_first()
yield items
-
把返回的item存到本地baidu_test.txt中
class MyspiderPipeline(object):
def open_spider(self, spider): # 爬虫打开的时候执行
self.f = open('baidu_test.txt', 'w', encoding="utf-8") # 在爬虫打开的时候打开一个文件
def process_item(self, item, spider): # 处理item,传递item的时候调用
self.f.write(str(item)) # 返回的item需要转为str,把返回的item写入到文件中
return item
def close_spider(self, spider): # 爬虫关闭的时候执行
self.f.close() # 在爬虫关闭的时候把文件关闭
-
注意!要使用pipelines,必须在setting中把
ITEM_PIPELINES
打开,就是把前面的注释去掉
ITEM_PIPELINES = {
'myspider.pipelines.MyspiderPipeline': 300,
}
setting.py 文件
# Scrapy settings for myspider project
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'myspider' # 项目名
SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'myspider (+http://www.yourdomain.com)' # 默认的User—Agent
# Obey robots.txt rules
ROBOTSTXT_OBEY = True # 是否遵守ROBOTSTXT协议,True遵守,False不遵守,默认遵守
# Configure maximum concurrent requests performed by Scrapy (default: 16) #最大并发数
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3 # 下载延时 单位是秒
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default) # 发送请求时默认携带cookies
#COOKIES_ENABLED = False # 开启就禁用cookie
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = { # 默认的请求头信息
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'myspider.middlewares.MyspiderSpiderMiddleware': 543,
#} # 爬虫中间键 前面是路径,后面是优先级,数值越小,优先级越高
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = { # 下载中间件
# 'myspider.middlewares.MyspiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = { # 管道PIPELINES
# 'myspider.pipelines.MyspiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
-
项目的配置文件,常用的设置可以在里面定义或者设置
实现翻页
-
匹配到下一页的url,使用
scrapy.Request(url,callback,headers,cookies)
-
参数url,访问的url
-
callback 回调函数,会调用哪个函数来处理
-
后面的headers和cookies等就是你要添加的请求头信息了
-
-
实例
-
import scrapy def parse(self, response): ''' 提取详情页url和下一页url :param response: :return: ''' # 所有详情页的url,返回的是一个selectorList deatai_urls = response.xpath('//ul[@class="seeWell cf"]/li/a/@href').extract() for deatai_url in deatai_urls: yield scrapy.Request(deatai_url, callback=self.parse_detail,headers=headers) next_url = response.xpath('//a[@class="next"]/@href').extract_first() yield scrapy.Request(next_url, callback=self.parse)