python爬虫scrapy框架

安装:pip install Scrapy
startproject:创建一个新项目
genspider:根据模板生成一个新爬虫
crawl:执行爬虫
shell:启动交互式抓取控制台
进入项目目录
scrapy startproject CrawlerTest(project name)
cd CrawlerTest
会生成如下文件:
items.py:定义了待抓取域的模型
settings.py:定义了一些设置,如用户代理,抓取延时等等
spiders/:该目录存储实际的爬虫代码
项目配置scrapy.cfg和处理要抓取的域pipelines.py,在这里无须修改
1.items.py修改如下

import scrapy
class CrawlertestItem(scrapy.Item):
    # define the fields for your item here like:
    #想要爬取的字段name
    name = scrapy.Field()
    #想要爬取的字段population
    population=scrapy.Field()
    pass

创建爬虫
通过genspider命令,传入爬虫名,域名,可选模板参数来生成初始模板。
命令如下:
scrapy genspider country CrawlerTest.webscraping.com –template=crawl
country:爬虫名
在CrawlerTest/spiders/country.py中自动生成如下代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CountrySpider(CrawlSpider):
  #爬虫名
    name = 'country'
    #可以爬取的域名列表
    allowed_domains = ['CrawlerTest.webscraping.com']
    #爬虫起始URL列表
    start_urls = ['http://CrawlerTest.webscraping.com/']
    #正则表达式集合
    rules = (
        Rule(LinkExtractor(allow='/index/'), follow=True),
        Rule(LinkExtractor(allow='/view/'), callback='parse_item')
    )
    #从响应中获取数据
    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

优化设置
默认情况下,Scrapy对同一域名允许最多8个并发下载,并且两次下载之间没有延时,而当下载速度持续高于每秒一个请求时,爬虫可能暂时被封禁。所以要在settings.py文件中添加请求限制和延时(添加随机偏移量)
CONCURRENT_REQUESTS_PER_DOMAIN=1
DOWNLOAD_DELAY=5
启动爬虫程序
scrapy crawl country -s LOG_LEVEL=DEBUG
显示正确爬取过程,但是浪费了大量时间爬取每个网页上的登录和注册表单链接,可以使用规则的deny函数,再次修改country.py文件,修改如下:
rules = (
Rule(LinkExtractor(allow=’/index/’,deny=’/user/’),follow=True),
Rule(LinkExtractor(allow=’/view/’,deny=’/user/’),callback=’parse_item’)
)
使用shell命令抓取
scrapy shell url
爬虫完整代码:country.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CountrySpider(CrawlSpider):
    name = 'country'
    allowed_domains = ['CrawlerTest.webscraping.com']
    start_urls = ['http://CrawlerTest.webscraping.com/']

    rules = (
        Rule(LinkExtractor(allow='/index/',deny='/user/'),follow=True),
        Rule(LinkExtractor(allow='/view/',deny='/user/'),callback='parse_item')
            )

    def parse_item(self, response):
        item = CrawlertestItem()
        name_css='tr#places_country_row td.w2p_fw::text'
        item['name'] = response.css(name_css).extract()
        pop_css='tr#places_population_row td.w2p_fw::text'
        item['population'] = response.css('pop_css').extract()
        return item

存储至一个CSV文件中:
scrapy crawl country –output=countries.csv -s LOG_LEVEL=INFO
中断与恢复爬虫
scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=crawls/country
保存目录:crawls/country
之后可以运行同样的命令恢复爬虫运行(主要适用于爬取大型网站)

猜你喜欢

转载自blog.csdn.net/CowBoySoBusy/article/details/80530599