python百度贴吧爬取 - 代码天地

python百度贴吧爬取

其他 2021-03-06 06:03:09 阅读次数: 0

#-*-coding:utf8-*-
from lxml import etree
from multiprocessing.dummy import Pool as ThreadPool
import requests
import json
import sys

reload(sys)

sys.setdefaultencoding('utf-8')

'''重新运行之前请删除content.txt，因为文件操作使用追加方式，会导致内容太多。'''

def towrite(contentdict):
    f.writelines(u'回帖时间:' + str(contentdict['topic_reply_time']) + '\n')
    f.writelines(u'回帖内容:' + unicode(contentdict['topic_reply_content']) + '\n')
    f.writelines(u'回帖人:' + contentdict['user_name'] + '\n\n')

def spider(url):
    html = requests.get(url)
    selector = etree.HTML(html.text)
    content_field = selector.xpath('//div[@class="l_post l_post_bright "]')
    item = {}
    for each in content_field:
        reply_info = json.loads(each.xpath('@data-field')[0].replace('"',''))
        author = reply_info['author']['user_name']
        content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0]
        reply_time = reply_info['content']['date']
        print content
        print reply_time
        print author
        item['user_name'] = author
        item['topic_reply_content'] = content
        item['topic_reply_time'] = reply_time
        towrite(item)

if __name__ == '__main__':
    pool = ThreadPool(4)
    f = open('content.txt','a')
    page = []
    for i in range(1,21):
        newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i)
        page.append(newpage)

    results = pool.map(spider, page)
    pool.close()
    pool.join()
    f.close()

猜你喜欢

转载自blog.csdn.net/luoxiping1/article/details/79871407

python爬取百度贴吧张国荣图片

python爬虫爬取百度贴吧图片

Python爬取百度贴吧图片

python爬取百度贴吧指定内容

python爬取百度贴吧Jpg图片

python学习笔记--爬取百度贴吧

python爬虫爬取百度贴吧帖子

Python爬取百度贴吧内容

python百度贴吧爬取

【Python真的很强大】使用scrapy爬取百度贴吧-上海吧

Python 爬百度贴吧里面的图片分页分帖子爬取

利用爬虫爬取百度贴吧内容

ulrlib案例-爬取百度贴吧

urllib爬取百度贴吧贴子页面

爬取百度贴吧图片

百度贴吧图片爬取

爬虫学习（五）————百度贴吧的爬取

Scrapy 爬取百度贴吧全站图片

简单爬取百度贴吧图片

go-百度贴吧-纵向爬取

爬取百度贴吧图片（表情包）

001 爬取百度贴吧并保存

实现对任意百度贴吧的html爬取

爬虫实战--爬取百度贴吧

爬取百度贴吧热议榜

爬取百度贴吧帖子页内容

今日成果:爬取百度贴吧

爬取百度暗影精灵5贴吧

爬虫实现百度贴吧的图片爬取

芝麻HTTP:Python爬虫实战之爬取百度贴吧帖子

今日推荐

Electron中的关于静态资源加载问题解决方案

《Cursor-AI编程》基础篇-界面指南

《Cursor-AI编程》基础篇-Tab代码智能补充

《Cursor-AI编程》基础篇-Composer功能详解

《Cursor-AI编程》基础篇-Chat功能详解

《Cursor-AI编程》进阶篇-自定义模型

《Cursor-AI编程》进阶篇-上下文详解

【大模型系列篇】最强检索增强技术GraphRAG基本原理详解

【大模型系列篇】基于Ollama和GraphRAG v2.0.0快速构建知识图谱

解释什么是迁移学习？在 CNN 中如何应用？（面试题200合集，高频、关键）

解释数据增强（Data Augmentation）的概念和方法（（面试题200合集，高频、关键））

揭秘大模型“魔法”：Function Calling 让 AI 不止会说，更能“做”！

周排行

ConfigurationClassParser类的parse方法源码解析

基础大讲堂-java 位运算符

ConsecutiveInteger判断给定的整数n能否表示成连续的m(m>1)个正整数之和

多项式问题之六——多项式快速幂

Spring Security技术栈开发企业级认证与授权（四）RESTful API服务异常处理

Linux基础命令---apachectl

MATLAB中的线性插值

Unity编辑器拓展之十七：NGUI ComponentSelector增加搜索框

SqlServer 备份还原教程

[Unity动画]01.

每日归档

更多

2025-04-12(10529)

2025-04-11(9561)

2025-04-10(1213)

2025-04-09(10354)

2025-04-08(12998)

2025-04-07(0)

2025-04-06(0)

2025-04-05(0)

2025-04-04(0)

2025-04-03(0)