Python crawler combat - LOL hero skin pictures (multi-threaded crawler, with detailed notes, all skins are crawled)

foreword

        The crawling of LOL’s skin is a website that I wanted to crawl when I was just learning crawlers. The course design was also done in this way. However, due to the limited technical level at that time, the crawled data was very sloppy, so now I have time to start again. Go see how this site climbs better.

Target

1. The skin data is obtained (the skin format I choose to save is png)

2. To obtain the name of the skin, to name the skin

 Examples are as follows:

 website analysis

        I also read some other people's crawling tutorials on the Internet, but the data they crawled would be incomplete, and many skins would not be crawlable, so I did it myself.

1. Open the official website of LOL, slide to the bottom, and click on a hero

 I just choose the first hero, Anne

 Open the interface, press F12, and then press F5 to refresh. I found this json file request file in the network request returned by the website. This file looks suspicious at first glance. After clicking the preview, sure enough, all the required data is placed here. Inside a json file.

 

 

You can see that there are the names of the heroes we need in it. After visiting these picture links one by one, I also found the picture links we need

1.1 Supplementary Notes

Insert a sentence here, some ways to get pictures on the Internet are through the links of combined pictures

 Such a link can be accessed. The first 1 is the id of the hero, and the last 0 is the first skin of the hero. Modify these two parameters to get the image data

 But there is a problem. This method is only effective for very old skins. It should be that since a certain point in time, Tencent has modified the data storage method.

For example:

       Anne has 16 skins, when we change the number to 16, the combined link cannot be accessed

 2. Data link

Observation link:

https://game.gtimg.cn/images/lol/act/img/js/hero/1.js?ts=2800033

? You don’t need to look at the following ones, they belong to some parameters. The 1 after the link is suspicious

 After changing 1 to 2, it is obvious that the data is returned, but when I try other numbers, 404 will appear (try the number is 130)

 Obviously there are 163 heroes in LOL, why can't I find the link if I lose 130? ? ? ? (When I climbed, it was 163 hero roars)

After observation, I found that this number actually corresponds to the hero's id or the official number

 It can be seen that the ID of the latest hero is much different from that of the previous hero. This problem did not exist in the early days

 3. Get the hero's id

To get the hero's id, you can't get it by entering the hero details page, so we need to go to the upper level of the details page to see.

 Click on the hero list, and after refreshing, you can find this json file named hero_list, which contains the hero id we need

4. Organize ideas

1. We first visit the link of the hero_list file to get the hero id

url = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js'

2. Then access the json file corresponding to the id to get the skin picture and the name of the picture

code writing

Required libraries:

import os
import requests
import time
from concurrent.futures import ThreadPoolExecutor

Set the header file of the crawler:

header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.54'
}

1. According to the above analysis, our first step is to get the hero’s id, because the request returns a JSON file, and then the data we want is placed in the list, so we don’t take the text and directly take the JSON file, and then Take the list that stores the data, and loop through the list to take out the data we need and store it as a new list.

# 获取LOL英雄的id
def ID():
    # 创建一个列表用于存放英雄的id
    id = []
    # 访问存放有英雄id的接口
    burl = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js'
    breq = requests.get(burl).json()  # 拿接口的JSON文件可以更快的帮我们定位到想要的数据
    for i in breq['hero']:
        id.append(i['heroId'])  # 把获取到的id放入列表
    return id

2. According to the hero id, access the JSON link that stores the skin name and skin link to obtain the data we need. If you choose to return text, you need to transcode the returned data.

# 获取LOL英雄的名字作为文件的命名
def name_spider(id, choose):
    names = []  # 存放英雄皮肤名字的列表
    surl = []  # 存放英雄皮肤链接的列表
    uname = 'https://game.gtimg.cn/images/lol/act/img/js/hero/' + str(id) + '.js'
    # 访问的链接,拿到每个英雄的详细信息的JSON文件
    ureq = requests.get(uname, headers=header).json()
    # 对访问到的JSON文件进行处理,获取到我们想要的数据
    for k in ureq['skins']:
        if k['chromas'] == '0':  # 当chromas标签 = 0时会是皮肤的原画( = 1是皮肤的炫彩,不过我们不需要这个)
            names.append(k['name'])
            surl.append(k['mainImg'])
    # 爬取图片和保存函数(全写在一起就太多了点不好看,就分开写了)
    photo_spider(names, surl, choose)  # 爬到的皮肤名称和图片的链接作为参数传进函数

3. Access the obtained skin link, and then download the picture, because we want to crawl the picture, so when we make a request access, we get a binary stream return.

# 爬取具体图片函数,choose=0会把图片分门别类,放到对应的英雄的文件夹里面,其他数字就直接把图片存入指定文件夹
def photo_spider(names, surl, choose=1, path=r'G:\\工作\\工作文件\\代码学习\\LOL皮肤'):
    # 判断选择的数据存储方式
    if choose == 0:
        # 下载的图片保存到的文件夹
        os.makedirs(path + r'\\' + str(names[0]))  # 在指定文件夹下创建个英雄皮肤的文件夹
        os.chdir(path + r'\\' + str(names[0]))  # 切换当前工作路径为指定路径
    else:
        os.chdir(path)  # 切换当前工作路径为指定路径
    # 对图片链接进行访问
    for i in range(len(surl)):
        photoreq = requests.get(surl[i]).content  # 因为是图片,所以这里我们拿二进制流返回
        # 保存到文件夹,按照皮肤的名字命名,保存的格式为png格式
        with open(names[i].replace('/', '') + '.png', 'wb') as f:  # 因为某些皮肤的名字中有"/"(例:K/DA这个系列的皮肤),会影响文件的保存,所以要替换掉
            f.write(photoreq)

 Here I provide two formats for saving, one is to put all the pictures in one folder, the other is to create a folder according to the hero name, and then put the skin corresponding to the hero into the folder

4. Turn on multi-threaded crawling. In order not to confuse the threads, we also need to introduce a thread pool

Let's have a bit of a multithreading joke:

Multithreading as you imagined:

 Actual multithreading:

It will be like this without adding a thread pool, but we are crawling pictures. If we crawl to the same one, the ones crawling behind will cover the ones crawling in the front, but this will reduce our efficiency. So whether you have multi-threading enabled or not? What's the difference?

I put multithreading in the main function:

if __name__ == '__main__':
    # 程序开始时间
    time_start = time.time()
    choose = input("输入需要存储的类型(0为把皮肤分类放到各个英雄的文件夹下):")
    # 调用函数爬取英雄的id
    id = ID()
    # 创建线程池,开启10个进程
    pool = ThreadPoolExecutor(max_workers=10)
    for pic_url in id:
        # 将耗时间的任务放到线程池中交给线程来执行
        pool.submit(name_spider, pic_url, choose)  # 执行的函数,函数所需的参数
    pool.shutdown()  # 让main函数在线程全部结束之后再结束
    # 程序结束时间
    time_end = time.time()
    print('程序执行的时间:' + str(time_end - time_start))

It takes 35s for me to open 10 threads to climb all the pictures. If I open a single thread, it takes 350s, which is exactly 10 times faster.

Summarize

 Overall, the difficulty of crawling is not difficult, and it doesn’t take much time to crawl all of them. This is the speed of turning on multi-threading, but I only turned on 10 threads. But don't open too many threads, don't put too much pressure on the server of other people's websites, if they crash, the police uncle will come to you.

full code

# -- coding: utf-8 --
# @Time : 2023/3/10 14:16
# @File : LOL_Photo.py
# @Software: PyCharm

import os
import requests
import time
from concurrent.futures import ThreadPoolExecutor


header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.54'
}


# 获取LOL英雄的id
def ID():
    # 创建一个列表用于存放英雄的id
    id = []
    # 访问存放有英雄id的接口
    burl = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js'
    breq = requests.get(burl).json()  # 拿接口的JSON文件可以更快的帮我们定位到想要的数据
    for i in breq['hero']:
        id.append(i['heroId'])  # 把获取到的id放入列表
    return id


# 获取LOL英雄的名字作为文件的命名
def name_spider(id, choose):
    names = []  # 存放英雄皮肤名字的列表
    surl = []  # 存放英雄皮肤链接的列表
    uname = 'https://game.gtimg.cn/images/lol/act/img/js/hero/' + str(id) + '.js'
    # 访问的链接,拿到每个英雄的详细信息的JSON文件
    ureq = requests.get(uname, headers=header).json()
    # 对访问到的JSON文件进行处理,获取到我们想要的数据
    for k in ureq['skins']:
        if k['chromas'] == '0':  # 当chromas标签 = 0时会是皮肤的原画( = 1是皮肤的炫彩,不过我们不需要这个)
            names.append(k['name'])
            surl.append(k['mainImg'])
    # 爬取图片和保存函数(全写在一起就太多了点不好看,就分开写了)
    photo_spider(names, surl, choose)  # 爬到的皮肤名称和图片的链接作为参数传进函数


# 爬取具体图片函数,choose=0会把图片分门别类,放到对应的英雄的文件夹里面,其他数字就直接把图片存入指定文件夹
def photo_spider(names, surl, choose=1, path=r'G:\\工作\\工作文件\\代码学习\\LOL皮肤'):
    # 判断选择的数据存储方式
    if choose == 0:
        # 下载的图片保存到的文件夹
        os.makedirs(path + r'\\' + str(names[0]))  # 在指定文件夹下创建个英雄皮肤的文件夹
        os.chdir(path + r'\\' + str(names[0]))  # 切换当前工作路径为指定路径
    else:
        os.chdir(path)  # 切换当前工作路径为指定路径
    # 对图片链接进行访问
    for i in range(len(surl)):
        photoreq = requests.get(surl[i]).content  # 因为是图片,所以这里我们拿二进制流返回
        # 保存到文件夹,按照皮肤的名字命名,保存的格式为png格式
        with open(names[i].replace('/', '') + '.png', 'wb') as f:  # 因为某些皮肤的名字中有"/"(例:K/DA这个系列的皮肤),会影响文件的保存,所以要替换掉
            f.write(photoreq)


if __name__ == '__main__':
    # 程序开始时间
    time_start = time.time()
    choose = input("输入需要存储的类型(0为把皮肤分类放到各个英雄的文件夹下):")
    # 调用函数爬取英雄的id
    id = ID()
    # 创建线程池,开启10个进程
    pool = ThreadPoolExecutor(max_workers=10)
    for pic_url in id:
        # 将耗时间的任务放到线程池中交给线程来执行
        pool.submit(name_spider, pic_url, choose)  # 执行的函数,函数所需的参数
    pool.shutdown()  # 让main函数在线程全部结束之后再结束
    # 程序结束时间
    time_end = time.time()
    print('程序执行的时间:' + str(time_end - time_start))

Guess you like

Origin blog.csdn.net/weixin_54243306/article/details/129807022