White will be in the end prostitute? Reptiles station B simple video crawling

Due

As a big fan of B station, before going to bed and getting up every swipe of a video that is a must. But I usually brush on the phone yesterday with a computer when the brush had wanted to download a video, but found there were no web app caching feature, so only reluctantly to write himself a download videos. ( Originally written last night, hard also recorded a video posted to the B station, said today that the results of content violations helpless )

Step by step to achieve

method one

Here Insert Picture Description
I think this is the page of the video, then F12, find the video URL
Here Insert Picture Description
found to be encrypted blob, blob to decrypt a lot about online blog, post. But in the end are a lot of return ts then stitching, that there is no direct get it? Network load content to go inside to see the video playback time, discovered
Here Insert Picture Description
Here Insert Picture Description
there are a lot of this, a look that is garbled video file, so we're going to get access to this Web site and then save the content to a local line. So we have to go to this Web site, I found that there is actually inside pages.
ctrl + F to search the page m4s, discovered the
Here Insert Picture Description
next regular match to a URL like, but save the file when I started with m4s format, but later found not open, so the saved file into flv format.

Method Two

As above general idea, but I found a api in the network
Here Insert Picture Description
Here Insert Picture Description
so that the second implementation is to visit this page, extract url from json returned, and then go visit url, saved to a file.

Code

This code is the code the first method, the second is also very simple, you can try it yourself, just pay attention to parameters like headers

import requests
import re

def get_html(url):
    return requests.get(url,headers=headers1).text

def parse(html):
    video_name=re.findall('<span class="tit">(.*?)</span>',html,re.S)[0]+'.flv'#本来是m4s,但是电脑打不开所以还是用flv
    print("正在爬取"+video_name+"...")
    video_url=re.findall('window.__playinfo__={.*?"baseUrl":"(.*?)".*?}',html,re.S)[0]
    # print(video_url)
    return video_url,video_name

def download(videourl,video_name):
    with open(video_name,'wb') as f:
        f.write(requests.get(videourl,headers=headers2,stream=True,verify=False).content)
    f.close()
    print("视频下载完成!")


if __name__ == '__main__':
    avid=input("请输入要爬取的视频id:")
    base_url=f'https://www.bilibili.com/video/av{avid}'
    headers1={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
        'Host': 'www.bilibili.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Encoding':'gzip, deflate, br',
        'Accept-Language':'zh-CN,zh;q=0.9',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive'
    }

    headers2={
        'Host':'cn-jsnj3-cmcc-v-14.bilivideo.com',
        'Accept-Encoding':'identity',
        'Accept-Language':'zh-CN,zh;q=0.9',
        'Origin':'https://www.bilibili.com',
        'Referer':base_url,
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    html=get_html(base_url)
    videourl,videoname=parse(html)
    download(videourl,videoname)

result

Here Insert Picture Description
Here Insert Picture Description
Shortcomings and improvements: The disadvantage is that there is no sound, crawling and integrated audio portion of the video still have to think about it, the other can match all the video id from home and cycle to crawl all of the videos, as well as some of the threads can also be added to the list.

At last

White whore just said play, crawling cycle nor desirable, not to the server too much pressure. B owners who do stand up video is not easy, we can from the wealth of high-quality video content learn a lot, so the next time refused, not white whore party! ! !

Published 85 original articles · won praise 55 · views 20000 +

Guess you like

Origin blog.csdn.net/shelgi/article/details/104228656