Python Reptile (6): Use Netscape area where Scrapy climb to take away information

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/ityard/article/details/102646738

Scrapy is using a Python language development, website for crawling data, extract structured data written application framework, and its wide use, such as: data mining, monitoring and automated testing. Installation Terminal command pip install Scrapyto.

Scrapy more attractive place to be: we can modify according to their needs, it offers a variety of types of reptiles base class, such as: BaseSpider, sitemap and other reptiles, the new version provides support for web2.0 reptiles.

1 Scrapy Introduction

1.1 Composition

  • Scrapy Engine (engine) : responsible for Spider, ItemPipeline, Downloader, middle Scheduler communications, signals, data transfer and so on.

  • Scheduler (Scheduler) : responsible for accepting a request Request sent from the engine, and organize them arranged in a certain way, into the team, when the engine needs, returned to the engine.

  • Downloader (Downloader) : responsible for downloading all Requests Scrapy Engine (engine) a request sent, and the acquired Responses returned Scrapy Engine (engine), the engine Spider to be processed.

  • Spider (crawler) : responsible for handling all Responses, parses the data extraction, data acquisition Item required fields, and submit to the engine need to follow up the URL, enter the Scheduler (Scheduler) again.

  • Item the Pipeline (pipe) : responsible for handling Spider in Item acquired, and post-processing, such as: a detailed analysis, filtering, storage and the like.

  • Downloader Middlewares (downloaded middleware) : a component can be extended download custom functions such as: setting agents, setting request class.

  • Spider Middlewares (Spider middleware) : a functional component can be customized and extended operation of the engine and the intermediate Spider communication, such as: Custom REQUEST request, response filters and the like.

General is: Spiderand Item Pipelinethe need to achieve our own, Downloader Middlewaresand Spider Middlewareswe can customize according to demand.

1.2 process of combing

1) SpiderThe need to send the request to the URL Scrapy Engineto the scheduler;

2) Scrapy Enginethe forwarded request URL Scheduler;

3) Schedulerafter the request is returned to the processing such collated Scrapy Engine;

4) Scrapy EngineAfter the request by the get Middlewaressent to Downloader;

5) Downloadersends a request to the internet, after obtaining the response, and after Middlewarestransmission to Scrapy Engine.

6) Scrapy EngineAfter obtaining the response is returned to Spider, Spidertreatment response, parse and extract data therefrom;

7) Spiderthe parsed data was Scrapy Enginegiven Item Pipeline, Item Pipelinethe data post-processing;

8) by re-extracted URL Scrapy Engineto Schedulerthe next cycle, until the free end of the request URL.

1.3 Scrapy deduplication mechanism

Providing Scrapy deduplication processing of the request, a weight class to RFPDupeFilterthe dupefilters.pyfile path is: Python安装目录\Lib\site-packages\scrapy, which has a method such request_seenmethod, the following source code:

def request_seen(self, request):
    # 计算 request 的指纹
    fp = self.request_fingerprint(request)
    # 判断指纹是否已经存在
    if fp in self.fingerprints:
        # 存在
        return True
    # 不存在,加入到指纹集合中
    self.fingerprints.add(fp)
    if self.file:
        self.file.write(fp + os.linesep)

It Scheduleris called when accepting the request, and then call request_fingerprinta method (to generate a fingerprint for the request), the following source code:

def request_fingerprint(request, include_headers=None):
    if include_headers:
        include_headers = tuple(to_bytes(h.lower())
                                 for h in sorted(include_headers))
    cache = _fingerprint_cache.setdefault(request, {})
    if include_headers not in cache:
        fp = hashlib.sha1()
        fp.update(to_bytes(request.method))
        fp.update(to_bytes(canonicalize_url(request.url)))
        fp.update(request.body or b'')
        if include_headers:
            for hdr in include_headers:
                if hdr in request.headers:
                    fp.update(hdr)
                    for v in request.headers.getlist(hdr):
                        fp.update(v)
        cache[include_headers] = fp.hexdigest()
    return cache[include_headers]

In the above code, we can see

fp = hashlib.sha1()
...
cache[include_headers] = fp.hexdigest()

It generates a unique hash value of a fixed length for each pass over the URL. Look again at the __init__way the source code is as follows:

def __init__(self, path=None, debug=False):
	self.file = None
	self.fingerprints = set()
	self.logdupes = True
	self.debug = debug
	self.logger = logging.getLogger(__name__)
	if path:
		self.file = open(os.path.join(path, 'requests.seen'), 'a+')
		self.file.seek(0)
		self.fingerprints.update(x.rstrip() for x in self.file)

We can see that there are self.fingerprints = set()this code is to be re-set by the characteristics of the set (set does not allow duplicate values).

By weight to dont_filterthe parameter setting shown in FIG.

dont_filterIs Falseopen to weight, is Truenot heavy.

2 Quick Start

Scrapy reptiles need to make the following four steps:

  • Create a project: create a crawler project
  • Clear objectives: clear objectives that you want to crawl (write items.py)
  • Production reptiles: Production start crawling reptile pages (written xxspider.py)
  • Storage: Design storage pipeline crawling content (written pipelines.py)

Where are we to take away climbing nets Beijing scenic information, for example, as shown:

2.1 Creating project

在我们需要新建项目的目录,使用终端命令 scrapy startproject 项目名 创建项目,我创建的目录结构如图所示:

  • spiders 存放爬虫的文件
  • items.py 定义数据类型
  • middleware.py 存放中间件
  • piplines.py 存放数据的有关操作
  • settings.py 配置文件
  • scrapy.cfg 总的控制文件

2.2 定义 Item

Item 是保存爬取数据的容器,使用的方法和字典差不多。我们计划提取的信息包括:area(区域)、sight(景点)、level(等级)、price(价格),在 items.py 定义信息,源码如下:

import scrapy

class TicketspiderItem(scrapy.Item):
    area = scrapy.Field()
    sight = scrapy.Field()
    level = scrapy.Field()
    price = scrapy.Field()
    pass

2.3 爬虫实现

在 spiders 目录下使用终端命令 scrapy genspider 文件名 要爬取的网址 创建爬虫文件,然后对其修改及编写爬取的具体实现,源码如下:

import scrapy
from ticketSpider.items import TicketspiderItem

class QunarSpider(scrapy.Spider):
    name = 'qunar'
    allowed_domains = ['piao.qunar.com']
    start_urls = ['https://piao.qunar.com/ticket/list.htm?keyword=%E5%8C%97%E4%BA%AC&region=&from=mpl_search_suggest']

    def parse(self, response):
        sight_items = response.css('#search-list .sight_item')
        for sight_item in sight_items:
            item = TicketspiderItem()
            item['area'] = sight_item.css('::attr(data-districts)').extract_first()
            item['sight'] = sight_item.css('::attr(data-sight-name)').extract_first()
            item['level'] = sight_item.css('.level::text').extract_first()
            item['price'] = sight_item.css('.sight_item_price em::text').extract_first()
            yield item
        # 翻页
        next_url = response.css('.next::attr(href)').extract_first()
        if next_url:
            next_url = "https://piao.qunar.com" + next_url
            yield scrapy.Request(
                next_url,
                callback=self.parse
            )

简单介绍一下:

  • name:爬虫名
  • allowed_domains:允许爬取的域名
  • atart_urls:爬取网站初始请求的 url(可定义多个)
  • parse 方法:解析网页的方法
  • response 参数:请求网页后返回的内容

yield

在上面的代码中我们看到有个 yield,简单说一下,yield 是一个关键字,作用和 return 差不多,差别在于 yield 返回的是一个生成器(在 Python 中,一边循环一边计算的机制,称为生成器),它的作用是:有利于减小服务器资源,在列表中所有数据存入内存,而生成器相当于一种方法而不是具体的信息,占用内存小。

爬虫伪装

通常需要对爬虫进行一些伪装,关于爬虫伪装可通过【Python 爬虫(一):爬虫伪装】做一下简单了解,这里我们使用一个最简单的方法处理一下。

  • 使用终端命令 pip install scrapy-fake-useragent 安装
  • 在 settings.py 文件中添加如下代码:
DOWNLOADER_MIDDLEWARES = {
    # 关闭默认方法
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 
    # 开启
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400, 
}

2.4 保存数据

我们将数据保存到本地的 csv 文件中,csv 具体操作可以参考:CSV 文件读写,下面看一下具体实现。

首先,在 pipelines.py 中编写实现,源码如下:

import csv

class TicketspiderPipeline(object):
    def __init__(self):
        self.f = open('ticker.csv', 'w', encoding='utf-8', newline='')
        self.fieldnames = ['area', 'sight', 'level', 'price']
        self.writer = csv.DictWriter(self.f, fieldnames=self.fieldnames)
        self.writer.writeheader()
    def process_item(self, item, spider):
        self.writer.writerow(item)
        return item

    def close(self, spider):
        self.f.close()

然后,将 settings.py 文件中如下代码:

ITEM_PIPELINES = {
    'ticketSpider.pipelines.TicketspiderPipeline': 300,
}

放开即可。

2.5 运行

我们在 settings.py 的同级目录下创建运行文件,名字自定义,放入如下代码:

from scrapy.cmdline import execute
execute('scrapy crawl 爬虫名'.split())

这个爬虫名就是我们之前在爬虫文件中的 name 属性值,最后在 Pycharm 运行该文件即可。

参考:

http://www.scrapyd.cn/doc/
https://www.liaoxuefeng.com/wiki/897692888725344/923029685138624


在这里插入图片描述


Guess you like

Origin blog.csdn.net/ityard/article/details/102646738