爬取所有博客 - 代码天地

爬取所有博客

其他 2018-10-19 22:08:09 阅读次数: 0

爬取所有博客的内容并转换成为pdf格式

from bs4 import BeautifulSoup
import pdfkit
import re


# <a href="https://blog.csdn.net/qq_41911569/article/details/83034422" target="_blank"><span class="">查看</span></a>
from gevent import os


def getPagehtml(url):  #获取网页的内容
    response = requests.get(url)
    return response.text


def createurl(text):  #从网页源码中匹配到每一片博客网址
    '''
    <a href="https://blog.csdn.net/qq_41911569/article/details/83034422" target="_blank"><span class="article-type type-1">原</span>爬取猫眼电影</a>
    :param text:
    :return:
    '''
    pattern = r'<a href="(https://blog.csdn.net/qq_41911569/article/.*?)" target="_blank">'
    return re.findall(pattern,text)

url = 'https://blog.csdn.net/qq_41911569'
text = getPagehtml(url)
createurl(text)


def get_blog_content(i,url):  #根据获取到的每一片的博客网址，获得博客的内容，并写入文件中
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html5lib')
    # 获取head标签的内容
    head = soup.head
    # 获取博客标题
    title = soup.find_all(class_="title-article")[0].get_text()
    # 获取博客内容
    content = soup.find_all(class_="article_content")[0]
    # 写入本地文件
    other = 'http://passport.csdn.net/account/login?from='
    with open('/home/kiosk/Desktop/python笔记/python_stack/day26/bs/westos%d.html' %i, 'w') as f:
        f.write(str(head))
        f.write('<h1>%s</h1>\n\n' %(title))
        f.write(str(content))

def main():
    # https://blog.csdn.net/qq_41911569/article/list/3
    article_url = []
    for i in range(3):
        url = 'https://blog.csdn.net/qq_41911569/article/list/%d' %(i+1)
        text = getPagehtml(url)
        article_url.append(createurl(text))
    article_url = [j for i in article_url for j in i]

    # print(article_url)
    for i,v in enumerate(set(article_url)):
        get_blog_content(i,v)


main()

结果：
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/qq_41911569/article/details/83044467

爬取所有博客

python爬取CSDN所有博客标题

【Python3 爬虫】爬取博客园首页所有文章

Scrapy研究探索（五）——自动多网页爬取（抓取某人博客所有文章）

python- （scrapy上）爬取csdn所有博客内容

如何爬取CSDN博客中分栏的所有文章的标题和链接

python爬取所有股票报道

python爬取HDU所有题目

爬取晨星所有基金评级

Python爬取网页所有小说

淘宝店铺所有ID爬取

第一个Python爬虫，爬取某个新浪博客所有文章并保存为doc文档

Scrapy简明教程(四)——爬取CSDN博客专家所有博文并存入MongoDB

Python爬虫自学系列（八）-- 项目实战篇（二）爬取我的所有CSDN博客

Python爬虫小实践：爬取任意CSDN博客所有文章的文字内容（或可改写为保存其他的元素），间接增加博客访问量

python爬取所有微信好友的信息

爬虫===爬取王者荣耀所有英雄皮肤图片

python爬虫练习--爬取所有微博

Scrapy爬取伯乐在线的所有文章

爬取网易云音乐单曲下的所有评论

Python 爬取携程所有机票

Scrapy爬取伯乐在线所有文章

爬虫实战——Scrapy爬取伯乐在线所有文章

爬取京东收件地址下得所有数据

爬取全国所有必胜客餐厅信息

Python爬虫爬取CSDND首页的所有的文章

爬取网易云音乐所有歌单信息

使用jsoup爬取所有成语

爬取某个链接里面的所有图片...

爬取高德天气所有城市的天气

今日推荐

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

周排行

计算机组成与设计（七）—— 除法器

Integer Approximation(分治+枚举)

大话数据库索引

windows10系统JDK的配置及下载地址

mysql实现秒值转换中原六仔平台搭建

Codeforces Round #556 (Div. 1)

百练1064 网线主管

Codeforces 995F Cowmpany Cowmpensation

子集生成之增量构造法，位向量法，二进制法

ERROR: cmd.exe failed with args /c "/APK\gradle\rungradle.bat...

每日归档

更多

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)