python 读取网页并分词 - 代码天地

python 读取网页并分词

其他 2018-07-02 20:39:43 阅读次数: 0

代码：

import requests
from bs4 import BeautifulSoup
import jieba

# 获取html
url = "http://finance.ifeng.com/a/20180328/16049779_0.shtml"
res = requests.get(url)
res.encoding = 'utf-8'
content = res.text

# 添加至bs4
soup = BeautifulSoup(content, 'html.parser')
div = soup.find(id = 'main_content')

# 写入文件
filename = 'news.txt'
with open(filename,'w',encoding='utf-8') as file_object:
    # <p>标签的处理
    for line in div.findChildren():
        file_object.write(line.get_text()+'\n')

# 使用分词工具
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))

with open(filename,'r',encoding='utf-8') as file_object:
    with open('cut_news.txt','w',encoding='utf-8') as file_cut_object:
        for line in file_object.readlines():
            seg_list = jieba.cut(line,cut_all=False)
            file_cut_object.write('/'.join(seg_list))

爬取结果：

分词结果：

猜你喜欢

转载自blog.csdn.net/u013288190/article/details/79736198

python 读取网页并分词

Python读取网页的文档

Python将文本内容读取分词并绘制词云图

【316】python.requests 读取网页信息

python : selenium 网页爬虫读取列表文件

分词————jieba分词（Python）

[Python爬虫]新闻网页爬虫+jieba分词+关键词搜索排序

python3爬虫（二）-使用beautiful soup 读取网页

Python套接字Socket读取http网页web数据

python3之后版本读取网页的内容

python从excel中读取数据并填写网页表格

使用python实现微博评论分词与关键词提取（从MySQL数据库中读取数据）

python中文分词，使用结巴分词对python进行分词

python结巴分词

结巴分词python

Python 进行结巴分词

python jieba分词

python jieba 分词初识

python 结巴分词学习

python jieba分词模块

[python]分词工具jieba

python结巴(jieba)分词

python——jieba分词过程

Python分词工具——pyhanlp

Python 中文分词

Python使用jieba分词

python实现中文分词

python之中文分词

python 无空格分词

Python jieba[结巴分词]

今日推荐

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

周排行

女程序员是这样被恶搞的

B/S 和 C/S 的优缺点

vector一直申请会怎样？

座头鲸识别比赛(Humpback Whale Identification)总结

Linux高性能服务器编程——I/O复用 select

Mysql连接数据库（当包使用）

通过URI获取的文件路径为null的解决方法

1022-Primes on Interval(素数筛选+二分查找) ZCMU

Python出现： TypeError: expected string or buffer

bzoj2434: [Noi2011]阿狸的打字机 ac自动机+树状数组

每日归档

更多

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)