爬取豆瓣电影的评论

好久没有爬虫了,今天突然叫爬豆瓣,有点懵了,不过看了看以前爬的,一葫芦画瓢整了一个这个。bs4和requests yyds!

分析一波

爬取的地址:https://movie.douban.com/subject/26588308/comments

  • 每次翻页可以看到只和start有关,一页展示20条评论
  • 下图是第二页的url,故第一页的start就是0

image-20211123161721768

  • 评论在span标签里面(class属性为short)

image-20211123162435247

代码

import urllib.request
from bs4 import BeautifulSoup
import time

absolute = "https://movie.douban.com/subject/26588308/comments"
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
}
comment_list = []


#解析html
def get_data(html):
    soup = BeautifulSoup(html,'lxml')
    if soup.string != None:
        return 0
    else:
        for each in soup.find_all(name="span",attrs={
    
    "class": "short"}):  #获取class属性为short的span标签
            textword = each.text
            comment_list.append(textword)

#获取HTML
def get_html(url,i):
    url = absolute + '?start=' + str(i) + '&limit=20&status=P&sort=new_score'
    print(url)
    try:
        request = urllib.request.Request(url=url, headers=headers)
        html = urllib.request.urlopen(request).read().decode("UTF-8")
        flag = get_data(html)
        if flag == 0:
            return 0
    except Exception as result:
        print("错误原因",result)
        return 0

#将数据写入文件
def save_txt(data):
    with open("comments.txt","w",newline='',encoding="utf-8") as f:
        j = 1
        for i in data:
            f.write('('+ str(j) + ')' +i)
            f.write("\n")
            j+=1


if __name__ == '__main__':
    i = 0  #每次翻页加20
    for j in range(0,10000000): #为了翻页设置的
        flag = get_html(absolute,i)
        time.sleep(2)
        i += 20
        if flag==0:  #标记,如果页面空白就跳出循环
            break
    save_txt(comment_list)

效果截图

image-20211123161955181

  • 上面这个错误的原因是因为最后爬取的页面的下一个页面为空,用户是访问不到的,故报错了,不过这个报错是自己设置的,也可以看成是爬取完毕的标志。

image-20211123162030841

猜你喜欢

转载自blog.csdn.net/weixin_48931875/article/details/121496380