Reptile URL stitching problems

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/gklcsdn/article/details/102727143

Reptile URL stitching problems

Recently wrote reptiles encountered a more wonderful mosaic of url problem, the simplified as follows

1. Known URL address

https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=0

2. Get the following address

https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=20

https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=40

......
3. Solution
# @Time : 2019/10/24 14:17
# @Author : GKL
# FileName : test.py
# Software : PyCharm

import re
import time


def spider(string, page):

    url = "https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=0"

    # 通过&拆分URL, 获取一个临时列表
    tem_li = url.split('&')

    # 获取tem_li 最后一项, 用于替换新的string
    num = tem_li[-1]

    # 通过索引替换tem_li最后一项
    tem_li[-1] = string

    # 拼接新的URL地址
    url = '&'.join(tem_li)

    print(url)

    # 获取下一个地址的page
    page += 20
    if page == 200:
        return

    # 获取新的string
    string = re.sub(r'\d+', str(page), num)
    time.sleep(1)
    spider(string, page)


if __name__ == '__main__':
    spider('page_start=0', 0)

Guess you like

Origin blog.csdn.net/gklcsdn/article/details/102727143