Python文本数据处理 - 代码天地

Python文本数据处理

其他 2018-07-10 16:13:47 阅读次数: 0

1、文本基本操作

text1 = 'Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991.'
# 字符个数
print(len(text1))

# 获取单词
text2 = text1.split(' ')
print('单词个数：', len(text2))
# 找出含有长度超过3的单词
print([w for w in text2 if len(w) > 3])
# 找出首字母大写的单词
print([w for w in text2 if w.istitle()])
# 以字母s结尾的单词
print([w for w in text2 if w.endswith('s')])
# 找出不重复的单词
text3 = 'TO be or not to be'
text4 = text3.split(' ')
print('单词个数：', len(text4))
print('不重复的单词个数：', len(set(text4)))
# 忽略大小写统计
set([w.lower() for w in text4])
print(len(set([w.lower() for w in text4])))

2、文本清洗

text5 = '            A quick brown fox jumped over the lazy dog.  '
text5.split(' ')
print(text5)
text6 = text5.strip()
print(text6)
text6.split(' ')
# 去掉末尾的换行符
text7 = 'This is a line\n'
text7.rstrip()
print(text7)

3、正则表达式

text8 = '"Ethics are built right into the ideals and objectives of the United Nations" #UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
print(text8)
text9 = text8.split(' ')
print(text9)
# 查找特定文本
# #开头的文本
print([w for w in text9 if w.startswith('#')])
# @开头的文本
print([w for w in text9 if w.startswith('@')])
# 根据@后的字符的样式查找文本
# 样式符合的规则：包含字母，或者数字，或者下划线
import re
print([w for w in text9 if re.search('@[A-Za-z0-9_]+', w)])
text10 = 'ouagadougou'
print(re.findall('[aeiou]', text10))
print(re.findall('[^aeiou]', text10))

猜你喜欢

转载自blog.csdn.net/happy5205205/article/details/80913360

Python文本数据处理

Python 文本数据处理

文本数据处理

python学习（五）：读写文本及文本数据处理

Python数据攻略-Pandas与文本数据处理

Pandas文本数据处理

【NLP】文本数据处理实践

文本数据处理(自然语言处理基础)

Mysql数据库大文本数据处理

Pandas文本数据处理 | 轻松玩转Pandas（4）

SQL 常见函数之文本数据处理

Pandas文本数据处理与时间序列

文本数据处理：基本技巧与实例分析

Python的数据处理:创建、舍弃、处理缺失值、文本数据分割、索引、切片方式

用python处理文本数据（5）

处理文本数据

Python数据预处理 - 文本数据的量化 - 代码实现

pandas 处理文本数据

对处理文本数据的认识

Pandas处理文本数据

keras处理文本数据

pandas基本数据处理

实验一用python处理文本数据(必做)

python学习-102-文本数据的预处理-分词

python数据处理

Python --数据处理

python的数据处理

Python 数据处理

数据处理（python）

数据处理-python

今日推荐

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

周排行

计算机组成与设计（七）—— 除法器

Integer Approximation(分治+枚举)

大话数据库索引

windows10系统JDK的配置及下载地址

mysql实现秒值转换中原六仔平台搭建

Codeforces Round #556 (Div. 1)

百练1064 网线主管

Codeforces 995F Cowmpany Cowmpensation

子集生成之增量构造法，位向量法，二进制法

ERROR: cmd.exe failed with args /c "/APK\gradle\rungradle.bat...

每日归档

更多

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)