scrapy note one (scrapy.Spider crawls the text and saves it)

Preface

Today, I encountered an xpath parsing problem when I was learning the crawler web project. I struggled for more than ten minutes and failed to solve it. What makes me uneasy is that this knowledge point is not difficult, and I have repeatedly learned it many times before. Such memory effect makes me have to re-examine the role of notes. Obviously, some blogs to record study notes and ruminate study content are becoming imminent, and Jane has become indispensable.

scapy installation

The installation process is laborious, and there are many csdn tutorials. Downloading the corresponding files step by step requires patience. When pip fails to download, the first choice is to change the mirror source, and then consider the installation of the .whl file

scrapy basic theoretical knowledge

Write it on a soft-faced notebook and combine it with a physical book. Practice first

scrapy instance record

Download novel chapter names and corresponding links

1. Create the project and start.py fileNew project and its crawler

Project directory
start.py content

from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', 'biquge'])

2. Sorting out the process

  • [1] settings.py basic settings (request header, protocol, pipeline)
  • [2]biquge.py code content (get and parse the webpage, get the item, and yield)
  • [3] items.py content (will get the field conversion Field)
  • [4] Pipelines.py content (xiaoshupPipeline(object), storage file: open_spider(self, spider); def process_item(self, item, spider); def close_spider(self, spider))
  • [5] This case crawls the content of a single page

3. The contents of each file

settings.py

ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
    
    
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
ITEM_PIPELINES = {
    
    
   'xiaoshuo.pipelines.XiaoshuoPipeline': 300,
}

biquge.py

  • [] Only the chapter names and chapter links of the novel were crawled. When trying to further crawl the content, 503 Service Unavailable appeared. It has not been resolved. Use scrapy to crawl the novel content.
  • `import scrapy
    from …items import XiaoshuoItem

class BiqugeSpider(scrapy.Spider):
name = ‘biquge’
allowed_domains = [‘paoshuzw.com’]
start_urls = [‘http://www.paoshuzw.com/10/10489/’]

def parse(self, response):
    #获取章节名
    name_list = response.xpath("//dd//text()").getall()
    for name in name_list:
        print(name)
        item = XiaoshuoItem(name=name)
        yield item
    #获取章节链接
    href_list = response.xpath("//dd//@href").getall()
    for href in href_list:
        print(href)
        item = XiaoshuoItem(href=href,name=name)
        yield item`

items.py

import scrapy


class XiaoshuoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    href = scrapy.Field()

import json

class XiaoshuoPipeline:
    def open_spider(self, spider):
        self.fp = open("小说.txt", "w", encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(json.dumps(dict(item), ensure_ascii=False) + "\n")  # 转换中文
        print(item)
        return item

    def close_spider(self, spider):
        self.fp.close()

The final effect

Part of the content of the case

  • Suitable for multiple pages with the same content that needs to be crawled
next_href = response.xpath("//a[@id='amore']/@href").get()
        print(next_href)
        #现在仅仅有url一半
        if next_href:
            #判断是否有,否则会陷入死循环
            next_url = response.urljoin(next_href)#自动加域名
            request = scrapy.Request(next_url)#创建request对象
            yield request#如果yield的是item就扔给pipeline如果yield的是request就发送给调度器让它再一次发送请求

to sum up

The above is only a preliminary application of scrapy, which can be used to crawl the text information of the website and store it in a specified file, and the crawling speed is extremely fast

  • How to further repeat the link to enter a link under the page to crawl content?
  • Can storing files be more arbitrary?
  • biquge.py is too simple to write, and it is only a single page. What can be done to achieve multi-page crawling? Repeated yield request is just to repeatedly crawl the content of the same rule. What if you want to crawl the content of other pages with different rules?
  • The power of scrapy is yet to come

Guess you like

Origin blog.csdn.net/qq_51598376/article/details/113760311