【实验】爬取m3u8视频

这两天发现一个视频网站有我想要的视频资源(https://m.jlszyy.org/play/2873-0-0.html),想爬取,发现没那么简单。

 里面是一个blob视频地址:

1 <video ... src="blob:https://jiexi.wysgtx.com/6764e0e3-2c88-44a4-bfea-341c32fb1697"></video>

 不是直接一个mp4视频,没办法通过这个地址得到视频。

经过一天的琢磨,各种查资料,总算找到办法了。下面是分析和解决过程。

【分析】

F12进到调试模式,点击Network,选择XHR,然后强制刷新网页,看到底加载了啥:

重点关注前两个index.m3u8。前后两个m3u8的response内容分别是:

  

可见,第一个index.m3u8存储了第二个m3u8的地址,而第二个m3u8则存储了ts媒体分片的地址。

什么是m3u8,什么是ts媒体分片,参考[1],这里不展开介绍。

目前只需要知道:将这些分片按顺序拼接到一起,就是我要的一个完整视频,这就够了。

最快的办法是,把每个视频的第二个m3u8(就是存储了所有ts分片地址的那个m3u8)地址复制出来,利用ffmpeg下载。

在CentOS环境下,利用下面这个命令成功下载到了视频。(参考[2])以下载第23集为例:

 1 ffmpeg -i https://sina.com-h-sina.com/20180906/18237_09096b18/800k/hls/index.m3u8 023.mp4 

如果只是要下载视频,那么到此就结束了,文章下面也不用看了。(Windows平台有一个 M3U8 Downloader工具,也是基于ffmpeg实现的,用法更简单,参考[3])

----------------------------- [ 华丽的分割线 ] -----------------------------

一开始我并没有正确使用这个ffmpeg(错用了第一个m3u8的地址),没下载到视频,所以放弃了这个好用的工具,转而自己撸代码去爬取......

从Headers可以获得m3u8的网址。

我的思路是,收集整部剧(36集)每个视频第一个m3u8的地址,然后解析并爬取第二个m3u8,接着就能爬取所有ts文件了。(36个url地址应该可以脚本爬取,不用手动收集,这是这个思路可优化的地方。当然,我没做这个优化。)

不废话,直接贴出可运行的python代码(最终版本:解析m3u8,多进程下载):

 1 from multiprocessing import Pool
 2 import requests
 3 import re
 4 import os
 5 
 6 ## 36个url
 7 url_list=[
 8     'https://sina.com-h-sina.com/20180906/18215_c0fc8873/' ,
 9     'https://sina.com-h-sina.com/20180906/18216_80a7de01/' ,
10     'https://sina.com-h-sina.com/20180906/18217_365c8b79/' ,
11     'https://sina.com-h-sina.com/20180906/18218_f02026af/' ,
12     'https://sina.com-h-sina.com/20180906/18219_7aca4012/' ,
13     'https://sina.com-h-sina.com/20180906/18220_ed8429ba/' ,
14     'https://cdn.youku-letv.net/20181130/11218_5aec4df0/' ,
15     'https://cdn.youku-letv.net/20181130/11217_64a3d965/' ,
16     'https://sina.com-h-sina.com/20180906/18223_5f07955a/' ,
17     'https://sina.com-h-sina.com/20180906/18224_0f7d62f7/' ,
18     'https://sina.com-h-sina.com/20180906/18225_70b7a36b/' ,
19     'https://sina.com-h-sina.com/20180906/18226_e51fc698/' ,
20     'https://sina.com-h-sina.com/20180906/18227_66271ec7/' ,
21     'https://sina.com-h-sina.com/20180906/18228_026dee76/' ,
22     'https://sina.com-h-sina.com/20180906/18229_371750e8/' ,
23     'https://sina.com-h-sina.com/20180906/18230_997b5e5f/' ,
24     'https://sina.com-h-sina.com/20180906/18231_88b8a296/' ,
25     'https://sina.com-h-sina.com/20180906/18232_f20b2ca2/' ,
26     'https://sina.com-h-sina.com/20180906/18233_3352578d/' ,
27     'https://sina.com-h-sina.com/20180906/18234_347f6882/' ,
28     'https://sina.com-h-sina.com/20180906/18235_44565aa7/' ,
29     'https://cdn.youku-letv.net/20181130/11203_138f5930/' ,
30     'https://sina.com-h-sina.com/20180906/18237_09096b18/' ,
31     'https://sina.com-h-sina.com/20180906/18238_154e2c99/' ,
32     'https://sina.com-h-sina.com/20180906/18239_e9fe736e/' ,
33     'https://cdn.youku-letv.net/20181130/11199_aaf119a6/' ,
34     'https://sina.com-h-sina.com/20180906/18241_e8d0a446/' ,
35     'https://sina.com-h-sina.com/20180906/18242_ccfe1975/' ,
36     'https://sina.com-h-sina.com/20180906/18243_316cafde/' ,
37     'https://sina.com-h-sina.com/20180906/18244_c67bb56b/' ,
38     'https://sina.com-h-sina.com/20180906/18245_f7707700/' ,
39     'https://sina.com-h-sina.com/20180906/18246_01688d4e/' ,
40     'https://sina.com-h-sina.com/20180906/18247_8bfc042c/' ,
41     'https://sina.com-h-sina.com/20180906/18248_e85bfdc0/' ,
42     'https://sina.com-h-sina.com/20180906/18249_b750a51f/' ,
43     'https://cdn.youku-letv.net/20181130/11189_acdfba0e/'
44 ]
45 
46 ## [注意]正阳门下的视频的第一个m3u8是指向另一个m3u8,并不是直接指向ts媒体分片
47 index='index.m3u8'
48 savefile_path='E://Downloads//ZhengYangMenXia//'
49 
50 
51 
52 def download_task(n):
53     i = int(n)
54     print('task %d start!' % i)
55     # os.mkdir(savefile_path+str(i)+'//')   ## 创建文件夹。如果手动创建了文件夹,那就注释掉这一句
56     data=requests.get(url_list[i]+index)
57 
58     lines=str.split(data.text, '\n')        ## 字符串处理。从第一层m3u8里面取得真正的m3u8地址
59     m3u8=lines[2]
60     pattern="index.m3u8"
61     m3u8_directory=re.sub(pattern,"", m3u8) ## 取文件夹路径,后面会用到
62 
63     data = requests.get(url_list[i] + m3u8)
64     '''
65     # 保存m3u8文件。此处注释掉,m3u8的内容就不存了。
66     file=open(savefile_path+str(i)+'//'+index, 'w')
67     file.write(data.text)
68     file.close()
69     '''
70     lines=str.split(data.text,'\n')
71     pattern=r".*ts"
72     # 保存ts文件(注意,是二进制的)
73     video = open(savefile_path + '//' + '正阳门下-' + str(i).zfill(3) + '.mp4', 'wb')
74     for j in range(0,len(lines)):
75         if re.match(pattern,lines[j]):
76             print("Task %d Downloading %s..." % (i, lines[j]))
77             data = requests.get(url_list[i]+m3u8_directory+lines[j])
78             video.write(data.content)
79     video.close()
80     print('Task %d done!' % i)
81 
82 
83 
84 if __name__=='__main__':                    ## 多进程程序,一定要有main函数。
85     pool=Pool(4)                            ## 4进程下载
86     for i in range(0,len(url_list)):
87         pool.apply_async(download_task, args=(i,))
88         print('Task %d has been submited' % i)
89     print('Waiting for all subprocesses done...')
90     pool.close()
91     pool.join()
92     print('All subprocesses done.')

这份代码是这么写成的:

【1】收集数据,体现为一个url列表:

 1 ## 36个url
 2 url_list=[
 3     'https://sina.com-h-sina.com/20180906/18215_c0fc8873/' ,
 4     'https://sina.com-h-sina.com/20180906/18216_80a7de01/' ,
 5     'https://sina.com-h-sina.com/20180906/18217_365c8b79/' ,
 6     'https://sina.com-h-sina.com/20180906/18218_f02026af/' ,
 7     'https://sina.com-h-sina.com/20180906/18219_7aca4012/' ,
 8     'https://sina.com-h-sina.com/20180906/18220_ed8429ba/' ,
 9     'https://cdn.youku-letv.net/20181130/11218_5aec4df0/' ,
10     'https://cdn.youku-letv.net/20181130/11217_64a3d965/' ,
11     'https://sina.com-h-sina.com/20180906/18223_5f07955a/' ,
12     'https://sina.com-h-sina.com/20180906/18224_0f7d62f7/' ,
13     'https://sina.com-h-sina.com/20180906/18225_70b7a36b/' ,
14     'https://sina.com-h-sina.com/20180906/18226_e51fc698/' ,
15     'https://sina.com-h-sina.com/20180906/18227_66271ec7/' ,
16     'https://sina.com-h-sina.com/20180906/18228_026dee76/' ,
17     'https://sina.com-h-sina.com/20180906/18229_371750e8/' ,
18     'https://sina.com-h-sina.com/20180906/18230_997b5e5f/' ,
19     'https://sina.com-h-sina.com/20180906/18231_88b8a296/' ,
20     'https://sina.com-h-sina.com/20180906/18232_f20b2ca2/' ,
21     'https://sina.com-h-sina.com/20180906/18233_3352578d/' ,
22     'https://sina.com-h-sina.com/20180906/18234_347f6882/' ,
23     'https://sina.com-h-sina.com/20180906/18235_44565aa7/' ,
24     'https://cdn.youku-letv.net/20181130/11203_138f5930/' ,
25     'https://sina.com-h-sina.com/20180906/18237_09096b18/' ,
26     'https://sina.com-h-sina.com/20180906/18238_154e2c99/' ,
27     'https://sina.com-h-sina.com/20180906/18239_e9fe736e/' ,
28     'https://cdn.youku-letv.net/20181130/11199_aaf119a6/' ,
29     'https://sina.com-h-sina.com/20180906/18241_e8d0a446/' ,
30     'https://sina.com-h-sina.com/20180906/18242_ccfe1975/' ,
31     'https://sina.com-h-sina.com/20180906/18243_316cafde/' ,
32     'https://sina.com-h-sina.com/20180906/18244_c67bb56b/' ,
33     'https://sina.com-h-sina.com/20180906/18245_f7707700/' ,
34     'https://sina.com-h-sina.com/20180906/18246_01688d4e/' ,
35     'https://sina.com-h-sina.com/20180906/18247_8bfc042c/' ,
36     'https://sina.com-h-sina.com/20180906/18248_e85bfdc0/' ,
37     'https://sina.com-h-sina.com/20180906/18249_b750a51f/' ,
38     'https://cdn.youku-letv.net/20181130/11189_acdfba0e/'
39 ]

【2】先试验性地将一集下载下来(也就是这一步确认了:只要将ts文件下载下来拼接到一起,就能得到一个完整视频)

 1 # coding=utf8
 2 import requests
 3 
 4 url='https://sina.com-h-sina.com/20180906/18215_c0fc8873/800k/hls/f997a136b73'
 5 
 6 for i in range(0,675):
 7     f = open("E:/Downloads/tmp/%03d.ts" % i, 'wb')
 8     data = requests.get(url + str(i).zfill(3) + '.ts')
 9     f.write(data.content)
10     f.close()
11     print('ts %03d OK' % i)

【3】遍历列表,爬取每个视频的第一个m3u8,然后解析出第二个m3u8的地址,接着爬取第二个m3u8,最后解析并爬取所有ts。

这也是我写的第一个版本

 1 import requests
 2 import re
 3 import os
 4 
 5 url_list=[
 6     'https://sina.com-h-sina.com/20180906/18215_c0fc8873/' ,
 7     'https://sina.com-h-sina.com/20180906/18216_80a7de01/' ,
 8     'https://sina.com-h-sina.com/20180906/18217_365c8b79/' ,
 9     'https://sina.com-h-sina.com/20180906/18218_f02026af/' ,
10     'https://sina.com-h-sina.com/20180906/18219_7aca4012/' ,
11     'https://sina.com-h-sina.com/20180906/18220_ed8429ba/' ,
12     'https://cdn.youku-letv.net/20181130/11218_5aec4df0/' ,
13     'https://cdn.youku-letv.net/20181130/11217_64a3d965/' ,
14     'https://sina.com-h-sina.com/20180906/18223_5f07955a/' ,
15     'https://sina.com-h-sina.com/20180906/18224_0f7d62f7/' ,
16     'https://sina.com-h-sina.com/20180906/18225_70b7a36b/' ,
17     'https://sina.com-h-sina.com/20180906/18226_e51fc698/' ,
18     'https://sina.com-h-sina.com/20180906/18227_66271ec7/' ,
19     'https://sina.com-h-sina.com/20180906/18228_026dee76/' ,
20     'https://sina.com-h-sina.com/20180906/18229_371750e8/' ,
21     'https://sina.com-h-sina.com/20180906/18230_997b5e5f/' ,
22     'https://sina.com-h-sina.com/20180906/18231_88b8a296/' ,
23     'https://sina.com-h-sina.com/20180906/18232_f20b2ca2/' ,
24     'https://sina.com-h-sina.com/20180906/18233_3352578d/' ,
25     'https://sina.com-h-sina.com/20180906/18234_347f6882/' ,
26     'https://sina.com-h-sina.com/20180906/18235_44565aa7/' ,
27     'https://cdn.youku-letv.net/20181130/11203_138f5930/' ,
28     'https://sina.com-h-sina.com/20180906/18237_09096b18/' ,
29     'https://sina.com-h-sina.com/20180906/18238_154e2c99/' ,
30     'https://sina.com-h-sina.com/20180906/18239_e9fe736e/' ,
31     'https://cdn.youku-letv.net/20181130/11199_aaf119a6/' ,
32     'https://sina.com-h-sina.com/20180906/18241_e8d0a446/' ,
33     'https://sina.com-h-sina.com/20180906/18242_ccfe1975/' ,
34     'https://sina.com-h-sina.com/20180906/18243_316cafde/' ,
35     'https://sina.com-h-sina.com/20180906/18244_c67bb56b/' ,
36     'https://sina.com-h-sina.com/20180906/18245_f7707700/' ,
37     'https://sina.com-h-sina.com/20180906/18246_01688d4e/' ,
38     'https://sina.com-h-sina.com/20180906/18247_8bfc042c/' ,
39     'https://sina.com-h-sina.com/20180906/18248_e85bfdc0/' ,
40     'https://sina.com-h-sina.com/20180906/18249_b750a51f/' ,
41     'https://cdn.youku-letv.net/20181130/11189_acdfba0e/'
42 ]
43 ## [注意]正阳门下的视频的第一个m3u8是指向另一个m3u8,并不是直接指向ts媒体分片
44 index='index.m3u8'
45 savefile_path='E://Downloads//ZhengYangMenXia//'
46 
47 for i in range(0,len(url_list)):
48     # os.mkdir(savefile_path+str(i)+'//')   ## 创建文件夹。如果手动创建了文件夹,那就注释掉这一句
49     data=requests.get(url_list[i]+index)
50 
51     lines=str.split(data.text, '\n')        ## 字符串处理。从第一层m3u8里面取得真正的m3u8地址
52     m3u8=lines[2]
53     pattern="index.m3u8"
54     m3u8_directory=re.sub(pattern,"", m3u8) ## 取文件夹路径,后面会用到
55 
56     data = requests.get(url_list[i] + m3u8)
57     '''
58     # 保存m3u8文件。注释掉,就不存了。
59     file=open(savefile_path+str(i)+'//'+index, 'w')
60     file.write(data.text)
61     file.close()
62     '''
63     lines=str.split(data.text,'\n')
64     pattern=r".*ts"
65     # 保存ts文件(注意,是二进制的)
66     video = open(savefile_path + str(i) + '//' + '正阳门下-' + str(i).zfill(3) + '.mp4', 'wb')
67     for j in range(0,len(lines)):
68         if re.match(pattern,lines[j]):
69             print("Downloading %s..." % lines[j])
70             data = requests.get(url_list[i]+m3u8_directory+lines[j])
71             video.write(data.content)
72     video.close()

【4】单进程下载不够快,改成多进程吧。于是有了上面贴出来的第一份可运行代码。(参考[5])

 【参考】

[1] m3u8的维基百科介绍:https://wikipedia.hk.wjbk.site/baike-M3U

[2] Linux平台,通过 ffmpeg命令下载视频:https://dzt666.cn/index.php/archives/203/

[3] Windows平台,通过 M3U8 Downloader 下载:https://blog.csdn.net/yjclsx/article/details/88948860

[4] m3u8和ts媒体分片的介绍:https://www.jianshu.com/p/4f6db32fc76a

[5] python多进程:https://www.liaoxuefeng.com/wiki/1016959663602400/1017628290184064

[6] 带来一点启发的参考资料:https://blog.csdn.net/angry_mills/article/details/82705595

猜你喜欢

转载自www.cnblogs.com/i-am-normal/p/11624225.html