2023-11 | Short video batch download/crawl all videos of a user | Python

Here we use Ju Jingyi’s personal homepage as a demo

https://www.douyin.com/user/MS4wLjABAAAACV5Em110SiusElwKlIpUd-MRSi8rBYyg0NfpPrqZmykHY8wLPQ8O4pv3wPL6A-oz

[2023-11-4 23:02:52 Saturday] The method may no longer be applicable due to subsequent adjustments to XX. Please note that

find interface

Findhttps://www.douyin.com/aweme/v1/web/aweme/post/Lu Jin’s interface

image-20231104230515850

The preview response data should be OK if it corresponds to the description of the posted video, but there are only 18 pieces of data.

The remaining data will appear when scrolling the progress bar.

The interface has a total of 37 parameters. Changing any one of them will result in no data being requested (status code 200 is returned, but there is no data). I didn’t think of a solution...

image-20231104230815754

Save the returned data to a json file

image-20231104231435708

Download video

import requests
import json
import os

# todo 错误处理
def download_video(url, path):
    print('\n开始下载视频...', path.split('/')[-1])
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        # 进度条
        total_length = int(r.headers.get('content-length'))
        print('视频大小:', total_length)
        for chunk in r.iter_content(chunk_size=1024 * 1024):
            if chunk:
                f.write(chunk)
            # 打印进度条
            print('\r' + '[下载进度]:%s%.2f%%' % (
                '>' * int((f.tell() / total_length) * 50), float(f.tell() / total_length) * 100), end='')


index = 0

# json_file, 接口返回的json文件位置
# save_file_dir, 保存视频的文件夹路径
def save_video_batch(json_file, save_file_dir):
    global index
    if not os.path.exists(save_file_dir):
        os.makedirs(save_file_dir)
    # 读取json文件
    with open(json_file, 'r', encoding='utf-8') as f:
        json_data = json.load(f)
        aweme_list = json_data['aweme_list']
        for aweme in aweme_list:
            video_url_list = aweme['video']['play_addr']['url_list']
            video_name = aweme['desc']
            # 一个视频有三个地址, 成功一个就break
            index += 1
            for video_url in video_url_list:
                # print(video_url)
                try:
                    download_video(video_url, f'{
      
      save_file_dir}{
      
      index}-{
      
      video_name}.mp4')
                    break
                except Exception as e:
                    print('下载失败')


save_video_batch('../params/鞠婧祎主页.json', '../data/鞠婧祎主页/')

Download results

image-20231104232212786

I feel like I did a great job on a certain sound, but it’s so hard to climb...

Try to directly obtain the html page and parse the html page, but the obtained html page is not the page browsed in the actual browser (not the verification code interface, I saw it)

The request interface is also the same. It can be requested in the API debugging tool, but it cannot be used in the code. It also returns a 200 status, but there is no data. The following is the code. I don’t know what is missing.

(Some data that I think is sensitive need to be replaced by myself)

import requests

headers = {
    
    
    'authority': 'www.douyin.com',
    'accept': 'application/json, text/plain, */*',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cache-control': 'no-cache',
    'cookie': 'cookie',   # 替换自己的cookie
    'pragma': 'no-cache',
    'referer': 'https://www.douyin.com/user/MS4wLjABAAAA0W6MrnV7YIYmneCLCypeKVoZj4VDk9amQorNZ8aIVfs',
    'sec-ch-ua': '"Chromium";v="118", "Microsoft Edge";v="118", "Not=A?Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.76',
}

params = (
    ('device_platform', 'webapp'),
    ('aid', '6383'),
    ('channel', 'channel_pc_web'),
    ('sec_user_id', 'MS4wLjABAAAA0W6MrnV7YIYmneCLCypeKVoZj4VDk9amQorNZ8aIVfs'),
    ('max_cursor', '1696500302000'),
    ('locate_query', 'false'),
    ('show_live_replay_strategy', '1'),
    ('need_time_list', '0'),
    ('time_list_query', '0'),
    ('whale_cut_token', ''),
    ('cut_version', '1'),
    ('count', '18'),
    ('publish_video_strategy_type', '2'),
    ('pc_client_type', '1'),
    ('version_code', '170400'),
    ('version_name', '17.4.0'),
    ('cookie_enabled', 'true'),
    ('screen_width', '1707'),
    ('screen_height', '1067'),
    ('browser_language', 'zh-CN'),
    ('browser_platform', 'Win32'),
    ('browser_name', 'Edge'),
    ('browser_version', '118.0.2088.76'),
    ('browser_online', 'true'),
    ('engine_name', 'Blink'),
    ('engine_version', '118.0.0.0'),
    ('os_name', 'Windows'),
    ('os_version', '10'),
    ('cpu_core_num', '16'),
    ('device_memory', '8'),
    ('platform', 'PC'),
    ('downlink', '10'),
    ('effective_type', '4g'),
    ('round_trip_time', '50'),
    ('webid', '7297499797400897065'),
    ('msToken', 'xxx'),  # 替换token
    ('X-Bogus', 'xxx'),  # 替换
)

response = requests.get('https://www.douyin.com/aweme/v1/web/aweme/post/', headers=headers, params=params)
# 响应200, 
print(response.status_code)
# 但是没有数据
print(response.text)

The current method is still very troublesome and needs to be improved.

Imagine that I only need to enter the URL address of the homepage, such ashttps://www.douyin.com/user/MS4wLjABAAAACV5Em110SiusElwKlIpUd-MRSi8rBYyg0NfpPrqZmykHY8wLPQ8O4pv3wPL6A-oz, and all the videos on the homepage will be automatically downloaded

【2023-11-7 17:02:20 Tuesday】
Solved hahaha, see here https://www.Douyin.com/video/7298386922798468406

Guess you like

Origin blog.csdn.net/qq_50969362/article/details/134225161