从图片下载中学习scrapy

scrapy 是python爬虫框架
主要有spider，items，middelwares，pipelines 等模块
1.spider
spider 可以定义爬虫，负责爬取网页链接，内容等功能
定义name 定义爬虫名字，
allowed_domains 过滤爬虫的域名，在插件OffsiteMiddleware启用的情况下（默认是启用的），不在此允许范围内的域名就会被过滤，而不会进行爬取
start_urls 定义爬虫起始页，是一个列表，包含多个url
parse 函数里面可以对start_url 传过来的response 进行解析。
yield scrapy.Request(url,callback=self.content) 方法可以对解析出来的url进行再次请求，
response.follow(next_url,callback=self.parse)
response.follow直接支持相对URL - 无需调用urljoin。请注意，response.follow只是返回一个Request实例；你仍然需要产生这个请求。
您也可以将选择器传递给response.follow代替字符串；该选择器应该提取必要的属性：

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

response.follow(response.css(‘li.next a’))无效，因为response.css返回一个包含所有结果选择器的类似列表的对象，而不是单个选择器。如上例所示的for循环，或response.follow(response.css(‘li.nexta’)[0])是可以的。

总结： response.follow（）
            ①支持相对url
            ②能返回单个选择器

2.css 解析

p:nth-last-child(2)
规定属于其父元素的第二个子元素的每个p元素，从最后一个p元素开始计数
p:last-child 
指定属于其父元素的最后一个子元素的p元素的背景色
:not(x)
是以一个简单的以选择器x为参数的功能性标记函数，他匹配不符合参数x描述的元素，x不能包含另外一个否定选择器
:not(foo) 将匹配任何非foo元素，包括html和body。
这个选择器只会应用在一个元素上，你无法用它排除所有父元素。比如， body :not(table) a 将依旧会应用在table内部的<a> 上, 因为 <tr>将会被 :not()这部分选择器匹配。
response.css("a::text").extract() 获取a标签的内容
response.css("a::attr(href)").extract_first() 获取a标签href 属性值(url)

spider 爬取源码

import scrapy
from Aoisolas.items import AoisolasItem

class MeinvSpider(scrapy.Spider):
    name = 'AoiSola'
    allowed_domains = ['www.mm131.com']
    start_urls = [
        'http://www.mm131.com/xinggan/', 
        "http://www.mm131.com/qingchun/",
        "http://www.mm131.com/chemo/",
        "http://www.mm131.com/qipao/",
        "http://www.mm131.com/mingxing/"
        ]

    def parse(self, response):
        list = response.css(".list-left dd:not(.page)")
        for img in list:
            imgname = img.css("a::text").extract_first()
            imgurl = img.css("a::attr(href)").extract_first()
            imgurl2 = str(imgurl)
            print(imgurl2)
            next_url = response.css(".page-en:nth-last-child(2)::attr(href)").extract_first()
            if next_url is not None:
                yield response.follow(next_url,callback=self.parse)
            yield scrapy.Request(imgurl2,callback=self.content)
    def content(self,response):
        item = AoisolasItem()
        item['name'] = response.css(".content h5::text").extract_first()
        item['ImgUrl'] = response.css(".content-pic img::attr(src)").extract()
        item['referer'] = response.url #防盗链设置
        yield item 
        next_url = response.css(".page-ch:last-child::attr(href)").extract_first()
        if next_url is not None:
            yield response.follow(next_url,callback=self.content)

3.items.py
定义要爬取的自段，可以定义一个类，类实例化后，返回的是一个对象

class AoisolasItem(scrapy.Item):
    name = scrapy.Field()
    ImgUrl = scrapy.Field()
    referer = scrapy.Field()
    image_paths = scrapy.Field()

4.pipelines.py
主要用来处理spider 请求返回的item对象，可以在这里面提取item里面的数据，比如url，内容等
pipelines 里面可以自定义一个类，里面有process_item方法
还可以继承ImagesPipeline ，可以用来下载图片
因为item传过来的是一个图片的url，但图片还没有下载，如果要下载的话，需要对传过来的url再次请求，可以使用get_media_request(self,item,info) 方法,图片防盗链一般是对headers 里面的referer字段进行判断，如果请求不是来自同一个网站的话，请求就会跳转到其他地方，所以为了解决防盗链的问题，需要在请求中加入referer,如果要传递参数的话，需要使用meta

def get_media_requests(self,item,info):
	for image_url in item['ImgUrl]:
		yield Request(image_url,meta={'item':item['name]},headers={'referer':item['referer']})

scrapy 下载的文件文件名是hash，所以如果要重命名的话，需要重写file_path 这个方法

def file_path(self,request,response=None,info=None):
	name =  request.meta['item']
	 name = re.sub(r'[？\\*|“<>:/()0123456789]', '', name) #替换一些特殊符号
	 image_guid = request.url.split("/")[-1]
	 filename = u'full/{0}/{1}'.format(name,image_guid)
	 return filename #最后返回filename

然后就是保存,调用item_completed 方法,results 是get_media_request请求后得到的结果列表，如果请求成功的状态就是ok

def item_completed(self,results,item,info):
	image_path = [ x['path'] for ok,x in results if ok ]
	if not image_path:
		raise DropItem('Item contains no images') #异常处理
	item['image_paths'] = image_path 
	return item

5.settings.py
是一个配置文件，可以设置爬取参数，IMAGES_STORE设置图片存放路径，ITEM_PIPELINES 打开pipelines

IMAGES_STORE ="/mnt/d/meizi2"
ITEM_PIPELINES = {
   'Aoisolas.pipelines.AoisolasPipeline': 300,
}

6.启动爬虫

scrapy crawl AoiSola

从图片下载中学习scrapy

猜你喜欢