scrapy study notes (a)

Environment: Windows 7 x64 Python3.7.1 PyCharm

First, install scrapy

1.1linux system: pip install scrapy

1.2Windows system:

  • pip install wheel
  • Download twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted (based on Python version download, here is my version of Python 3.7 so we under 3.7)
  • pip install 路径\Twisted-19.2.1-cp37-cp37m-win_amd64
  • pip install pywin32
  • pip install scrapy

 Second, create a project scrapy

1, a new project, you can choose Python. I project name created here is demo. Once you've created an empty project.

2, Terminal click pycharm below, as shown below:

In the terminal type: scrapy startproject Demo command to create scrapy project, create the following directory structure will appear after successful:

The role of each file is as follows:

  • scrapy.cfg :: project profile
  • demo /: The project python module. The addition of the code.
  • demo / items.py: item files in the project is mainly used for storing the structured definition data, similar to the ORM models.
  • demo / pipelines.py: pipelines in the project file, storage designation data (stored in the form of files, stored in the database).
  • demo / settings.py: Settings file for the project.
  • demo / spiders /: spider drop directory code. We write the code of reptiles in this directory.

3. Create a file reptiles

3.1 In the terminal type: cd demo (I am here because my input demo project name is demo)

3.2在终端中输入:scrapy genspider books books.toscrape.com  (scrapy genspider  应用名称 爬取网页的起始url)

 

 4、打开books文件,该文件结构如下:

5、爬取http://books.toscrape.com/的书籍信息。

5.1分析http://books.toscrape.com/页面。

由上图我们可以知道所有书籍都存放在div/ol/下的li标签中。这里我们只打印书名,由此我们可以像下面这样写来提取数据。

5.2books中的部分代码如下:

def parse(self, response):
        '''
        数据解析,提取。
        :param response: 爬取到的response对象
        :return:
        '''
        book_list = response.xpath('/html/body/div/div/div/div/section/div[2]/ol/li')
        for book in book_list:
            print(book.xpath('./article/div[1]/a/img/@alt').extract())

5.3在setting.py中配置如下:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0'   # UA头
ROBOTSTXT_OBEY = False   # 如果为True表示准信robots协议,则大多数数据都爬不了。所以这里设置为Flase
LOG_LEVEL = 'ERROR'  # 日志等级

5.4在终端中执行爬取命令:scrapy crawl books

# 打印内容如下
['A Light in the Attic']
['Tipping the Velvet']
['Soumission']
['Sharp Objects']
['Sapiens: A Brief History of Humankind']
['The Requiem Red']
['The Dirty Little Secrets of Getting Your Dream Job']
['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull']
['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics']
['The Black Maria']
['Starving Hearts (Triangular Trade Trilogy, #1)']
["Shakespeare's Sonnets"]
['Set Me Free']
["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"]
['Rip it Up and Start Again']
['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991']
['Olio']
['Mesaerion: The Best Science Fiction Stories 1800-1849']
['Libertarianism for Beginners']
["It's Only the Himalayas"]

由此我们可以看出这里只是爬取了1页,下面来爬取所有书籍名称。

6、爬取所有页面的书籍。

最终books.py的内容看起来像下面这样:

# -*- coding: utf-8 -*-
import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'  # 爬虫的唯一标识
    allowed_domains = ['books.toscrape.com']
    # 要爬取的起点,可以是多个。
    start_urls = ['http://books.toscrape.com/']
    url = 'http://books.toscrape.com/catalogue/page-%d.html'   # url模板用于拼接新的url
    page_num = 2
    def parse(self, response):
        '''
        数据解析,提取。
        :param response: 爬取到的response对象
        :return:
        '''
        print(f'当前页数{self.page_num}')  # 打印当前页数的数据
        book_list = response.xpath('/html/body/div/div/div/div/section/div[2]/ol/li')
        for book in book_list:
            print(book.xpath('./article/div[1]/a/img/@alt').extract())
        if self.page_num < 50:  # 总共50页的内容
            new_url = format(self.url % self.page_num)  # 拼接处新的URL
            self.page_num += 1  # 页数加1
            yield scrapy.Request(url=new_url, callback=self.parse)  # 手动发送请求

在终端中执行命令获取书名:scrapy crawl books

如果一切顺利你会看到打印的最终部分结果如下:

今日小结:

  • 创建scrapy项目:scrapy startproject 爬虫项目名称。
  • 创建爬虫应用:scrapy genspider books books.toscrape.com ((scrapy genspider  应用名称 爬取网页的起始url))应用名称在整个项目中作为唯一标识,不能出现同名的爬虫应用。
  • 运行爬虫程序:scrapy crawl books(scrapy  crawl  爬虫应用)。
  • parse方法:当一个页面下载完成后,Scrapy引擎会回调一个我们指定的页面解析函数(默认为parse方法)解析页面。一个页面解析函数通常需要完成以下两个任务:

    提取页面中的数据(使用XPath或CSS选择器)。
    提取页面中的链接,并产生对链接页面的下载请求。

  • 页面解析函数通常被实现成一个生成器函数,每一项从页面中提取的数据以及每一个对链接页面的下载请求都由yield语句提交给Scrapy引擎。

parse方法的工作机制(来源网络):

1. 因为使用的yield,而不是return。parse函数将会被当做一个生成器使用。scrapy会逐一获取parse方法中生成的结果,并判断该结果是一个什么样的类型;
2. 如果是request则加入爬取队列,如果是item类型则使用pipeline处理,其他类型则返回错误信息。
3. scrapy取到第一部分的request不会立马就去发送这个request,只是把这个request放到队列里,然后接着从生成器里获取;
4. 取尽第一部分的request,然后再获取第二部分的item,取到item了,就会放到对应的pipeline里处理;
5. parse()方法作为回调函数(callback)赋值给了Request,指定parse()方法来处理这些请求 scrapy.Request(url, callback=self.parse)
6. Request对象经过调度,执行生成 scrapy.http.response()的响应对象,并送回给parse()方法,直到调度器中没有Request(递归的思路)
7. 取尽之后,parse()工作结束,引擎再根据队列和pipelines中的内容去执行相应的操作;
8. 程序在取得各个页面的items前,会先处理完之前所有的request队列里的请求,然后再提取items。
9. 这一切的一切,Scrapy引擎和调度器将负责到底。

 

 

 

Guess you like

Origin www.cnblogs.com/caesar-id/p/11122803.html