今天来学习一下使用scrapy对图片的进行抓取
1. 创建项目
scrapy startproject xiaohuascrapy
创建spider文件,取名xiaohua.py
2.定义 Item
import scrapy from scrapy.item import Item, Field class XiaohuascrapyItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() file_urls = scrapy.Field() files = scrapy.Field()
3.编写spider文件
# -*- coding: utf-8 -*- import scrapy from xiaohuascrapy.items import XiaohuascrapyItem words = '张馨予' class XiaohuaSpider(scrapy.Spider): name = "xiaohua" allowed_domains = ["baidu.com"] custom_settings = {#重写存储路径 'FILES_STORE' : '/图片/baidu/%s' % words } pn = 0 def __init__(self , keywords = '' , *args , **kwargs): super(XiaohuaSpider , self).__init__(*args , **kwargs) self.url = 'http://image.baidu.com/search/flip?tn=baiduimage&word=' + words self.start_urls = [ self.url ] def parse(self, response): item = XiaohuascrapyItem() item['file_urls'] = response.selector.re(r'''"objURL":"(http://[^"]+?)"''') yield item self.pn += 20 yield scrapy.Request('%s%s%d' % (self.url , '&pn=' , self.pn) , self.parse)
4.设置setting文件
BOT_NAME = 'xiaohuascrapy' SPIDER_MODULES = ['xiaohuascrapy.spiders'] NEWSPIDER_MODULE = 'xiaohuascrapy.spiders' USER_AGENTS = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", ] # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'xiaohuascrapy (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False COOKIES_ENABLED = False ITEM_PIPELINES = { 'scrapy.pipelines.files.FilesPipeline': 100, } LOG_LEVEL = 'DEBUG'
到这里,基本就结束了,运行项目
scrapy crawl xiaohua就会在磁盘根目录下生成一个图片文件夹,打开就可以看到图片了。