斗鱼直播弹幕爬虫

斗鱼弹幕服务器和一般的网页内容不一样,通过socket通信连接。斗鱼规定了弹幕的协议。

过程如下:

向斗鱼弹幕服务器发起连接房间请求,服务器回应;发起进入弹幕组请求,服务器回应;然后服务器持续发送弹幕消息,服务器要求每45s发送一次心跳包。

具体协议内容请搜索斗鱼弹幕服务协议。

我找了8个房间,获取房间id(少数房间id不在地址栏显示,一般网页标签中会显示主播和房间id),设置了主播名(自己随便写,主要用来给文件起名)。

每个房间的爬虫都是一个进程,开启8个进程后,主进程结束。由于跨进程通讯很麻烦,每次循环进程会寻找当前目录是否存在‘运行.txt’,删除该文件,子进程即可停止。下次运行之前手动创建该文件即可。

源代码如下:

# -*- coding: utf-8 -*-
"""
Created on Thu Apr 26 15:15:38 2018

@author: 蚂蚁不在线

多进程模式,每个进程负责一个直播间的弹幕。输入房间id号,和网址。一般两者相同。少数房间网址为字母
"""
import socket
import time
import datetime
import requests
from bs4 import BeautifulSoup
import multiprocessing
import sys
import re
import os
####发送信息
def sendmsg(client,msgstr):
    msg=msgstr.encode('utf-8')
    data_length=len(msg)+8
    code=689
    msg=int.to_bytes(data_length,4,'little')*2+ b'\xb1\x02\x00\x00'+msg
    ##十进制转换为二进制
    sent=0
    while sent<len(msg):
        tn=client.send(msg[sent:])
        sent=sent+tn

def keeplive(t,t0,client):
    if int(time.time())>t0:
        msg='type@=keeplive/tick@=' + str(int(time.time())) + '/\x00'
        sendmsg(client,msg)
        print('程序已启动',int(time.time())-t,'s')
        return t0+40
    else:
        return t0
       
def start(roomid,zbname):
    ####连接弹幕服务器
    client=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
    host=socket.gethostbyname("openbarrage.douyutv.com")
    port=8601
    client.connect((host,port))
    ####申请加入房间,用户名密码随意
    msg='type@=loginreq/username@=蚂蚁不在线/password@=就看看/roomid@={}/\x00'.format(roomid)
    sendmsg(client,msg)
    client.recv(1024)
    ####加入全弹幕讨论组
    msg_more='type@=joingroup/rid@=%s/gid@=-9999/\x00'%roomid
    sendmsg(client,msg_more)
    client.recv(1024)
    ####下次发送心跳包的时间
    t=int(time.time())
    t0=int(time.time())+40
    flg=True
    while flg:
        flg=os.path.exists('运行.txt')
        data=client.recv(1024)
        mlist=re.findall(b'type@=(.+?)/\x00',data)
        for i in mlist:
            msgd={}
            j=b'type@='+i
            m=j.split(b'/')
            for n0 in m:
                n=n0.split(b'@=')
                try:
                    msgd[n[0].decode()]=n[1].replace(b'@S',b'/').replace(b'@A',b'@').decode()
                except:
                    pass
            #### 发言信息记录
            if msgd['type']=='chatmsg':
                try:
                    f=open('%s.txt'%(roomid+'_'+zbname),'a',encoding='utf-8')
                    f.write( msgd['type']+',' \
                             +msgd['uid']+',' \
                             +msgd['nn']+',' \
                             +msgd['txt']+',' \
                             +msgd['level']+',' \
                             +datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') \
                             +'\n')
                    f.close()
                except:
                    pass
        t0=keeplive(t,t0,client)
if __name__=='__main__':
    roomid=['102411','606118','1126960','475252','846805','288016','12313','71017']
    zbname=['神超','大司马','余小c','孙悟空','赏金术士','lpl','叶音符','冯提莫']
    for i in range(len(roomid)):
        p1=multiprocessing.Process(target=start,args=(roomid[i],zbname[i],))
        p1.start()

猜你喜欢

转载自www.cnblogs.com/offline-ant/p/9235504.html