Reptile practice - crawling Jane book network users dynamic information (to deal with AJAX)

Preface:

AJAX response to dynamic loading, crawling Jane books dynamic network user information, and crawling data stored in MongoDB database

This article is finishing the code, carding ideas, verify the validity of the code --2020.1.21


Environment:
A Python 3 (Anaconda3)
PyCharm
Chrome browser

Main modules: as instruction cmd window mounting brackets within the heel
(the install Requests PIP) Requests
lxml (the install lxml PIP)
pymongo (the install pymongo PIP)

1

First, a brief loaded asynchronously (AJAX), it is actuallyA case without reloading the entire page, the technical part of the web page can be updated.1

Reflected on the page as shown in the screenshot, click the "Articles" and click on "dynamic", they have not changed the URL url, here is the use of so-called asynchronous loading (AJAX).
Here Insert Picture Description
Here Insert Picture Description

2

So when reptiles crawling information, how do we deal with it, open the Developer Tools F12, switch to Network Interface XHR and select the file, click dynamic, then Developer Tools interface should be as shown in the screenshot, there is a file calledtimeline?_pjax=%23list-containerThe file is generated, this is the first step to find the dynamic file, and get real request URL (block up).
Here Insert Picture Description

3

At this time we want totrySimplify delete some of the parameters in the URL unnecessary.

# 原URL
https://www.jianshu.com/users/c5a2ce84f60b/timeline?_pjax=%23list-container

# 精简后
https://www.jianshu.com/users/c5a2ce84f60b/timeline

So that we can construct another URL by URL streamlined.

4

When we are looking for other pages, and found not the kind of page article, Jane paging through the book network is loaded asynchronously to achieve, so let us slide page to see which files are loaded.
Here Insert Picture Description
Stay on a page where the first parameter whimsical such as: https://www.jianshu.com/users/c5a2ce84f60b/timeline?page=2, but not so simple.
Here Insert Picture Description
Yes, here is the access to the interface, BUT, the interface and the interface of the first page is the same as a hair ah / (¨Ò o ¨Ò) / ~ ~, indicating that another parametermax_idIs also an important parameter, then the next will consider how to obtain max_id.

5

Here need a pair of sparkling eyes, we look for a variety of found files in these last XHRat theElement tagidNext the attribute value is the max_id + 1. OK, fix the problem.
Here Insert Picture Description

6

2-6 step is to deal with simple AJAX net book before, but this method is calledReverse Engineering. After you respond to this 'anti reptile', the next job is to gather information and inserted into MongoDB database.
The information we need to crawl dynamic type (as follows, "like_note" means "like articles") and dynamic publishing time.
Here Insert Picture Description

7

Here specific web analytics I will not say more, see the complete code, the code is relatively simple, and I gave them each a comment, where else can not understand, please leave a comment or private letter to me.

The complete code

# url = "https://www.jianshu.com/users/c5a2ce84f60b/timeline?_pjax=%23list-container"
# 导入库
import requests
from lxml import etree
import pymongo

# 连接MongoDB数据库
client = pymongo.MongoClient('localhost', 27017)

# 创建数据库和数据集合
mydb = client['mydb']
timeline = mydb['timeline']

# 加入请求头
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}


def get_time_info(url, page):
    # 对url进行拆分,获取用户id,url如:https://www.jianshu.com/users/c5a2ce84f60b/timeline
    user_id = url.split('/')
    user_id = user_id[4]

    # 如果是第一页后的url,其中会包含'page='的字样,让它进行翻页
    if url.find('page='):
        page = page+1

    html = requests.get(url=url, headers=headers)
    selector = etree.HTML(html.text)
    print(url, html.status_code)

	# 首先将它分为许多个li块,方便后续解析数据
    infos = selector.xpath('//ul[@class="note-list"]/li')
    for info in infos:
        # 时间
        dd = info.xpath('div/div/div/span/@data-datetime')[0]
        # 动态类型
        type = info.xpath('div/div/div/span/@data-type')[0]
        
        # 以json或者说是字典的格式 插入数据
        timeline.insert_one({'date': dd, 'type': type})
        print({'date': dd, 'type': type})

    # 获取id,以达到构造动态页面的url的目的
    id_infos = selector.xpath('//ul[@class="note-list"]/li/@id')
    if len(infos) > 1:
        feed_id = id_infos[-1]
        # feed-id的原样例如:feed-578127155,需手工切分
        max_id = feed_id.split('-')[1]
        # 构造动态页面的url
        next_url = 'http://www.jianshu.com/users/%s/timeline?max_id=%s&page=%s' % (user_id, max_id, page)
        # 递归调用,爬取下一页的信息
        get_time_info(next_url, page)


if __name__ == '__main__':
    get_time_info('https://www.jianshu.com/users/c5a2ce84f60b/timeline', 1)

  1. ajax (Ajax Development) ↩︎

Published 56 original articles · won praise 70 · views 8895

Guess you like

Origin blog.csdn.net/weixin_44835732/article/details/104065246