Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
Reptile URL stitching problems
Recently wrote reptiles encountered a more wonderful mosaic of url problem, the simplified as follows
1. Known URL address
https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=0
2. Get the following address
https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=20
https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=40
......
3. Solution
# @Time : 2019/10/24 14:17
# @Author : GKL
# FileName : test.py
# Software : PyCharm
import re
import time
def spider(string, page):
url = "https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start=0"
# 通过&拆分URL, 获取一个临时列表
tem_li = url.split('&')
# 获取tem_li 最后一项, 用于替换新的string
num = tem_li[-1]
# 通过索引替换tem_li最后一项
tem_li[-1] = string
# 拼接新的URL地址
url = '&'.join(tem_li)
print(url)
# 获取下一个地址的page
page += 20
if page == 200:
return
# 获取新的string
string = re.sub(r'\d+', str(page), num)
time.sleep(1)
spider(string, page)
if __name__ == '__main__':
spider('page_start=0', 0)