文章目录

一、前言
二、思路
三、流程

环境
前期准备
1.1 爬取单页评论
1.3 过滤数据
1.4 全部爬取
1.5 数据分析
关键字云图
词云图

完整代码

一、前言

近日,B站UP主~~cram阿强~~(假吃强)的假吃、网暴路人,教唆孝子骂人、撒谎欺骗观众的事件持续发酵.小学弟也被它的操作恶心了很久.作为代码的搬运工,自然不能通过口吐芬芳来发泄,遂现准备通过Python对假吃强的视频评论进行爬取与数据分析,看看再大众眼中对它有怎样的评价.

还能说什么好呢? 只能说希望阿强 ##

二、思路

通过查询资料得知,B站的视频评论以JSON格式保存,那么首先需要爬取评论,保存到本地,然后处理JSON数据,最后对评论信息进行词性分析处理,生成可视化数据.

三、流程

环境

IDE : Pycharm

接口工具 : Postman

前期准备

准备爬取的视频为:

这是假吃强最有代表性的一个视频,共105469条评论

为了爬取一个视频的评论,需要以下参数

pn : 即page number评论的页码

oid : 视频的oid编码

在视频播放页面按F12,进入控制台,选择网络,选择以下文件

可以看到一个URL地址,这就是获取视频评论的接口,但不能直接使用,因为其中一些参数是不必要的.
从URL中可得本视频的oid=583574337,若想获取第一页信息,则令pn=1,修改接口的URL为:

https://api.bilibili.com/x/v2/reply?=&=&pn=1&type=1&oid=583574337&sort=2&_=1595903965207

利用Postman测试

可以看到是没问题的.顺便说一句,假吃强的删评能力确实出众

1.1 爬取单页评论

既然已经知道了接口地址,oid=583574337,则只需要遍历所有页码,改变pn参数的值,最终就可以获取所有评论

代码

# 获取指定视频指定页码的所有评论
def get_comment(oid,pn):
    url = 'https://api.bilibili.com/x/v2/reply?=&=&pn=%s&type=1&oid=%s&sort=2' % (pn, oid)
    res = requests.get(url,'utf-8')
    # 将获取的数据转换为dict格式
    data = res.json()
    file_name = '%s-%s.json' % (oid, pn)
    # JSON文件格式为uttf-8
    with open(file_name, "w",encoding='utf-8') as fp:
        # 以JSON格式保存文件,indent决定JSON缩进,ensure_ascii确保汉字不被转换为编码
        fp.write(json.dumps(data, indent=4,ensure_ascii=False))

if __name__ == '__main__':
    get_comment('1', '583574337')

运行结果

可以看到已经将假吃强该视频第一页的评论爬取并保存到了本地的JSON文件中

1.3 过滤数据

上面得到的JSON文件中包含很多无用信息,接下来需将评论提取出并保存至.txt文件中

def filter_data(file):
    # 读取json文件内容,返回字典格式
    with open(file, 'r', encoding='utf8') as fp:
        data = json.load(fp)
    comment_list = data["data"]["replies"]
    for i in comment_list:
        # 获取评论列表中首条评论
        first_comment = i["content"]["message"]
        # 去除空格和换行符
        first_comment = first_comment.replace('\n', '').replace('\r', '')
        print(first_comment)

可以看到,当前代码只能获取首条评论,而无法获取回复的评论,因此修改为:

def filter_data(file):
    # 读取json文件内容,返回字典格式
    with open(file + '.json', 'r', encoding='utf8') as fp:
        data = json.load(fp)
        data = data["data"]
    fp.close()
    comment_list = data["replies"]

    with open(file + '.txt', 'w', encoding='utf-8') as f:
        for i in comment_list:
            # 获取评论列表中首条评论
            first_comment = i["content"]["message"]
            # 去除空格和换行符
            first_comment = first_comment.replace('\n', '').replace('\r', '')
            f.write(first_comment + '\n')
            # 获取回复的评论
            if "replies" in i.keys():
                comments = i["replies"]
                if isinstance(comments, list):
                    for j in comments:
                        c = j["content"]["message"]
                        c = c.replace('\n', '').replace('\r', '')
                        f.write(c + '\n')
                else:
                    comments = comments.replace('\n', '').replace('\r', '')
                    print(comments + '\n')
    f.close()

注意,上面代码获取的回复评论均为热度前三的回复

可以看到已经将假吃强该视频第一页的评论及每条评论前三条回复爬取下来并存入文件中

1.4 全部爬取

爬取共2334页评论,并写入txt文件

import requests
import json

def get_json(oid, pn):
    url = 'https://api.bilibili.com/x/v2/reply?=&=&pn=%s&type=1&oid=%s&sort=2' % (pn, oid)
    res = requests.get(url, 'utf-8')
    # 将获取的数据转换为dict格式
    data = res.json()
    file_name = '%s-%s.json' % (oid, pn)
    # JSON文件格式为uttf-8
    with open('json/' + file_name, "w", encoding='utf-8') as fp:
        # 以JSON格式保存文件,indent决定JSON缩进,ensure_ascii确保汉字不被转换为编码
        fp.write(json.dumps(data, indent=4, ensure_ascii=False))
    print(pn + "爬取完成")


def filter_data(oid, pn):
   # 读取json文件内容,返回字典格式
    with open('json/%s-%s.json' % (oid, pn), 'r', encoding='utf8') as fp:
        data = json.load(fp)
        data = data["data"]
    fp.close()
    comment_list = data["replies"]

    with open('all.txt', 'a', encoding='utf-8') as f:s
        for i in comment_list:
            # 获取评论列表中首条评论
            first_comment = i["content"]["message"]
            # 去除空格和换行符
            first_comment = first_comment.replace('\n', '').replace('\r', '')
            f.write(first_comment + '\n')
            # 获取回复的评论
            if "replies" in i.keys():
                comments = i["replies"]
                if isinstance(comments, list):
                    for j in comments:
                        c = j["content"]["message"]
                        c = c.replace('\n', '').replace('\r', '')
                        f.write(c + '\n')
                else:
                    pass
    f.close()


if __name__ == '__main__':
    for i in range(1, 2335):
        j = str(i)
        get_json('583574337', j)
        filter_data('583574337',j)

经过漫长的等待

我去,不容易啊,终于搞定了

这里我原本是将每个JSON都写入一个txt文件,但后来考虑到要进行词性分析,故修改了代码,写入到了一个txt中

…这么多的吗?

1.5 数据分析

关键字云图

def analyze_txt():
    # 统计字出现次数的字典
    num = {}
    word = ['的', '得', '不', '人', '了', '我', '是', '你', '这', '就', '有', '一','在','还','他','么','阿','强']
    # 非统计范围
    with open('all.txt', 'r', encoding='UTF-8') as text:
        for line in text:
            for i in line:
                # 如果是汉字
                if u'\u4e00' <= i <= u'\u9fa5' and i not in word:
                    # 如果该字已经被统计
                    if i in num.keys():
                        num[i] += 1
                    else:
                        num[i] = 1
    # 绘制词云
    wc = wordcloud.WordCloud(
        font_path='simsun.ttc',
        max_words=1000,
        max_font_size=2000,
        # 设置了背景，宽高,
        width=1000,
        height=880,
        background_color="white"
    )
    wc.generate_from_frequencies(num)
    wc.to_file("jcqbs.jpg")

唉,假吃没有强,因为强不在计算范围内

词云图

def analyze_word():
    f = open('all.txt', 'r', encoding='UTF-8').read()

    # 结巴分词，生成字符串，wordcloud无法直接生成正确的中文词云
    cut_text = " ".join(jieba.cut(f))

    wc = wordcloud.WordCloud(
        # 设置字体，不然会出现口字乱码，文字的路径是电脑的字体一般路径，可以换成别的
        font_path="C:/Windows/Fonts/simfang.ttf",
        # 设置了背景，宽高
        background_color="white", width=1000, height=880).generate(cut_text)
    wc.to_file("jcqbl.jpg")

    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")

    plt.show()

阿强在大,不如doge引人注目

还绿了,莫非说?

在这里插入图片描述

修正了下程序，差点就被影帝演了

完整代码

import wordcloud
import requests
import json
import jieba

import matplotlib.pyplot as plt


def get_json(oid, pn):
    url = 'https://api.bilibili.com/x/v2/reply?=&=&pn=%s&type=1&oid=%s&sort=2' % (pn, oid)
    res = requests.get(url, 'utf-8')
    # 将获取的数据转换为dict格式
    data = res.json()
    file_name = '%s-%s.json' % (oid, pn)
    # JSON文件格式为uttf-8
    with open('json/' + file_name, "w", encoding='utf-8') as fp:
        # 以JSON格式保存文件,indent决定JSON缩进,ensure_ascii确保汉字不被转换为编码
        fp.write(json.dumps(data, indent=4, ensure_ascii=False))
    print(pn + "爬取完成")


def filter_data(oid, pn):
    # 读取json文件内容,返回字典格式
    with open('json/%s-%s.json' % (oid, pn), 'r', encoding='utf8') as fp:
        data = json.load(fp)
        data = data["data"]
    fp.close()
    comment_list = data["replies"]

    with open('all.txt', 'a', encoding='utf-8') as f:
        for i in comment_list:
            # 获取评论列表中首条评论
            first_comment = i["content"]["message"]
            # 去除空格和换行符
            first_comment = first_comment.replace('\n', '').replace('\r', '')
            f.write(first_comment + '\n')
            # 获取回复的评论
            if "replies" in i.keys():
                comments = i["replies"]
                if isinstance(comments, list):
                    for j in comments:
                        c = j["content"]["message"]
                        c = c.replace('\n', '').replace('\r', '')
                        f.write(c + '\n')
                else:
                    pass
    f.close()


def analyze_txt():
    # 统计字出现次数的字典
    num = {}
    word = ['的', '得', '不', '人', '了', '我', '是', '你', '这', '就', '有', '一', '在', '还', '他', '么', '阿', '强']
    # 非统计范围
    with open('all.txt', 'r', encoding='UTF-8') as text:
        for line in text:
            for i in line:
                # 如果是汉字
                if u'\u4e00' <= i <= u'\u9fa5' and i not in word:
                    # 如果该字已经被统计
                    if i in num.keys():
                        num[i] += 1
                    else:
                        num[i] = 1
    # 绘制词云
    wc = wordcloud.WordCloud(
        font_path='simsun.ttc',
        max_words=1000,
        max_font_size=2000,
        # 设置了背景，宽高,
        width=1000,
        height=880,
        background_color="white"
    )
    wc.generate_from_frequencies(num)
    wc.to_file("jcqbs.jpg")


def analyze_word():
    f = open('all.txt', 'r', encoding='UTF-8').read()

    # 结巴分词，生成字符串，wordcloud无法直接生成正确的中文词云
    cut_text = " ".join(jieba.cut(f))

    wc = wordcloud.WordCloud(
        # 设置字体，不然会出现口字乱码，文字的路径是电脑的字体一般路径，可以换成别的
        font_path="C:/Windows/Fonts/simfang.ttf",
        # 设置了背景，宽高
        background_color="white", width=1000, height=880).generate(cut_text)
    wc.to_file("jcqbl.jpg")

    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")

    plt.show()

if __name__ == '__main__':
    pass

数据分析展示B站UP主假吃强(Cram阿强)的面目-评论篇