Python爬虫：学习啦网站文章内容爬取 - 代码天地

Python爬虫：学习啦网站文章内容爬取

其他 2020-02-26 20:21:01 阅读次数: 0

免责声明：代码仅作技术交流使用，如有侵权请联系本人删除！

起因就是想白嫖形政作业时这个网站告诉我复制要扫码或者收费
我：？？？
在这里插入图片描述

不好意思
流氓会武术
谁也挡不住

import requests
from requests.exceptions import HTTPError
import re

def GetHTML(url,path):       
    try:
        res=requests.get(url)
        res.raise_for_status()
        coding=res.encoding
        with open(path,"w+",encoding=coding) as MyFile:
            MyFile.write(res.text)
        
    except HTTPError:
            print("HTTP Error!")
        
    except ConnectionError:
        print("Failed to connect!")
def DataWash(path):
    mid=[]
    final=[]
    with open(path,"r",encoding="gbk") as ReadFile:           
            MyLines=ReadFile.readlines()                      
            for ML in MyLines:  
                            
                if re.search("img",ML)==None and re.search("</p>",ML)!=None:
                    mid.append(ML)           
            for i in mid:                
                i=re.sub("<.+?>"," ",i)
                i=re.sub("&.+?;"," ",i)               
                final.append(i)
    return final
def SaveFile(final,path):
    with open(path,"w+",encoding="gbk") as FinalFile:
            for i in final:               
                if len(i)!=0:
                    FinalFile.write(i)
                    FinalFile.write("\n")

if __name__=='__main__':
    url=input('输入学习啦url')   
    path=input('输入存放路径')
    GetHTML(url,path)
    final=DataWash(path)
    SaveFile(final,path)
    print('Done')

效果如下
在这里插入图片描述

发布了61 篇原创文章 · 获赞 11 · 访问量 4883

私信关注

猜你喜欢

转载自blog.csdn.net/weixin_43249758/article/details/102651438

Python爬虫：学习啦网站文章内容爬取

python爬取csdn的文章内容

用Python网络爬虫框架Scrapy实现对新华网的文章内容爬取

Python 爬虫进阶篇-利用beautifulsoup库爬取网页文章内容实战演示

python爬取微信公众号文章（包含文章内容和图片）

话本小说网-文章内容爬取

PHP之使用CURL爬取文章列表、略缩图、及文章内容

第二战：Xpath爬取静态网页文章内容

查询数据，从链接地址中爬取文章内容jsoup

Python--爬虫之(斗图啦网站)图片爬取

python爬虫学习之路(1) 利用urllib爬取网站

python3scrapy模块爬取国家粮油信息中心的政策法规和产业信息标题、文章内容等信息

Python爬取文章和小说内容

Python 爬虫爬取微信文章

【python】微博内容爬虫主要爬取某大大的微博文章

Python爬虫爬取新浪新闻内容

【python爬虫自学笔记】-----爬取简书网站首页文章标题与链接

python爬取网站内容

Python爬虫——爬取网站的图片

python爬虫爬取网站数据

python爬虫-爬取网站图片。

Python爬虫：爬取网站电影信息

Python爬虫爬取美剧网站

python 爬虫爬取某网站的漫画

Python爬虫爬取网站上的图片

Python爬虫：爬取网站视频

Python爬虫爬取网站小说

python爬虫爬取网站图片

python爬虫之爬取网站图片

python爬虫之爬取网站小说

今日推荐

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

周排行

Metasploit文件目录与入侵基本概念

跨域(CORS)请求问题[No 'Access-Control-Allow-Origin' header is present on the requested resource]常见解决方案

CodeIgniter 源码解读之 CodeIgniter.php（二）

SAS入门之（四）改变数据类型

初识元组

[数学建模]数学建模算法和模型（B站视频）（二）

Nginx 服务器源码安装配置流程

C#实现语音视频录制【基于MCapture + MFile】

开发进度4

下载安装vue的方法网址

每日归档

更多

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)