[Scrapy Five Minutes Website] [Technology Industry News] Scrapy actual combat 36 krypton site data capture

Target website introduction

36氪36氪 through a comprehensive and exclusive perspective to deeply analyze the most cutting-edge information for users, and is committed to let some people see the future first, covering newsletters, technology, finance, investment, real estate, automobiles, the Internet, stock market, education, and life , Workplace, etc., adhering to...
Insert picture description here

Start Scrapy

Preparation for data collection

1. For those who don’t understand the idea of ​​quickly crawling the website in 5 minutes, let’s first look at
[Scrapy 5 minutes on the website] the basic knowledge of the entire site data

2. I don’t understand data capture, business management, and sorting. First look at
[Scrapy Five Minutes Website] crawler target sorting and data preparation

3. For those who don’t know the mass production of Scrapy template, see it first (must see)
[Scrapy Five-Minutes Website] Data Capture Project Framework General Template

Data collation results

1. All channel Url get address
Insert picture description here

2. Save screenshots in Excel
Insert picture description here

Template application

<Project>.py file under Spider

1. Create a spider project

scrapy genspider www_36kr_com " "

2. Organize
the CSS styles of the whole site Let's first look at the CSS styles of the page, and the styles are unified across the site.
Insert picture description here

3. Modify the content of www_36kr_com.py

Here, the areas that need to be modified are explained, and other places refer to the template, and no modification is required.

  • Scope & custom description
    allowed_domains = []
    web_name = "36氪"
  • Add crawl data information
    start_menu = [
        # 全资讯频道第一页部分
        [
            {
    
    "channel_name": "最新", "url": "https://36kr.com/information/web_news/latest", },
            {
    
    "channel_name": "推荐", "url": "https://36kr.com/information/web_recommend", },
            {
    
    "channel_name": "创投", "url": "https://36kr.com/information/contact", },
            {
    
    "channel_name": "Markets", "url": "https://36kr.com/information/ccs", },
            {
    
    "channel_name": "汽车", "url": "https://36kr.com/information/travel", },
            {
    
    "channel_name": "科技", "url": "https://36kr.com/information/technology", },
            {
    
    "channel_name": "企服", "url": "https://36kr.com/information/enterpriseservice", },
            {
    
    "channel_name": "生活", "url": "https://36kr.com/information/happy_life", },
            {
    
    "channel_name": "创新", "url": "https://36kr.com/information/innovate", },
            {
    
    "channel_name": "房产", "url": "https://36kr.com/information/real_estate", },
            {
    
    "channel_name": "职场", "url": "https://36kr.com/information/web_zhichang", },
            {
    
    "channel_name": "其他", "url": "https://36kr.com/information/other", },
        ]
        # 动态加载部分,及后面的页码
        # 加载网址是 https://gateway.36kr.com/api/mis/nav/ifm/subNav/flow
        # 参数是 下面这样 需要后面页码自行整理
        # {"partner_id": "web", "timestamp": 1614135556442,
        #  "param": {"subnavType": 1, "subnavNick": "web_news", "pageSize": 30, "pageEvent": 1,
        #            "pageCallback": "eyJmaXJzdElkIjozMjIwNjQ1LCJsYXN0SWQiOjMyMjAyNjgsImZpcnN0Q3JlYXRlVGltZSI6MTYxNDEyMTg5ODE2NSwibGFzdENyZWF0ZVRpbWUiOjE2MTQwNzc5ODUwMDB9",
        #            "siteId": 1, "platformId": 2}}
    ]
  • Style finishing

There are as many parseX as there are in the overall website data list and added to

        parse_list = [
            self.parse1,  # 全资讯频道第一页部分
        ]
  • Title&Link&Cover
        Item_title = response.xpath('//div[@class="kr-shadow-content"]/div[2]/p/a/text()').extract()  # 文章标题列表
        Item_url = response.xpath('//div[@class="kr-shadow-content"]/div[2]/p/a/@href').extract()  # 文章链接列表

Parse_detail.py file under Spider

1. Fetch the content of the detail page

Modify the CSS crawling style of the list data detail page
Insert picture description here

    # 处理详情页带格式,这里整个页面进行抓取
    item['content'] = ""
    if 'class="article-content"' in response.text and len(None2Str(item['content'])) < 5:
        item['content'] = response.xpath('//div[@class="article-content"]').extract_first()

2. Special instructions

The programmers of some websites are frantic to a certain extent, 10 pages and 9 styles. Since it is impossible for us to open every page and look at the CSS format of the detail page, there is a general solution.

  • After the content is captured for the first time, open the MongoDB database and execute the following command to filter out the page data containing the body. These are the data that is not captured according to the specified style, but the data of all the pages that are directly captured.
db.你的表名.find({content:/body/})

Insert picture description here

  • Open any link loop to process the content of the details page until the mongo command does not filter out the content.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/114013817