Python中如何解析段落: 检测不带标点的句子

给定一段不带标点的文本,如何将其划分为句子?例如,我们需要将以下文本划分为多个句子:

步骤包括:提高移动网络、数据中心、数据传输和频谱分配的效率;减少应用程序通过缓存、压缩和对等数据传输等未来技术从网络中提取数据的数量;通过教育人们了解数据的使用,在最初提供免费数据访问时创建蓬勃发展的商业模式,以及建立信用卡基础设施,以便运营商可以从预付费模式转向支持投资的后付费模式,从而使投资获利。如果计划奏效,移动运营商将获得更多客户,并在可及性方面投入更多资金;手机制造商会看到人们想要更好的设备;互联网服务提供商将能够连接更多的人;人们将获得负担得起的互联网,以便他们能够加入知识经济,并与他们关心的人联系。

NLTK等现有的自然语言处理工具包无法很好地解决这个问题,因为它们主要依赖标点符号来识别句子边界。简单地检查大写字母也不能奏效,因为有些句子不以大写字母开头。

2、解决方案
我们可以使用以下方法来解决此问题:

  1. 将文本划分为单词
  2. 对于每个单词,检查它是否是大写字母或后面的单词是标点符号
  3. 如果满足以上条件,则将单词之前的文本视为一个句子
  4. 否则,将单词添加到当前句子中

以下的Python代码实现了此算法:

paragraph = 'Steps toward this goal include: Increasing efficiency of mobile networks, data centers, data transmission, and spectrum allocation Reducing the amount of data apps have to pull from networks through caching, compression, and futuristic technologies like peer-to-peer data transfer Making investments in accessibility profitable by educating people about the uses of data, creating business models that thrive when free data access is offered initially, and building out credit card infrastructure so carriers can move from pre-paid to post-paid models that facilitate investment If the plan works, mobile operators will gain more customers and invest more in accessibility; phone makers will see people wanting better devices; Internet providers will get to connect more people; and people will receive affordable Internet so they can join the knowledge economy and connect with the people they care about.'
words = []
sentences = []
oldValue = 0
for value in range(len(paragraph)):
    if paragraph[value] in ['.', ',', ':', ';']:
        words.append(paragraph[oldValue:value+1])
        oldValue = value+2
        sentences.append(' '.join(words))
        words = []
    elif paragraph[value].isupper():
        words.append(paragraph[oldValue:value+1])
        oldValue = value+1
        sentences.append(' '.join(words))
        words = []
    else:
        words.append(paragraph[value])
sentences.append(' '.join(words))
print(sentences)

输出:

['Steps toward this goal include: ', 'Increasing efficiency of mobile networks, data centers, data transmission, and spectrum allocation ', 'Reducing the amount of data apps have to pull from networks through caching, compression, and futuristic technologies like peer-to-peer data transfer ', 'Making investments in accessibility profitable by educating people about the uses of data, creating business models that thrive when free data access is offered initially, and building out credit card infrastructure so carriers can move from pre-paid to post-paid models that facilitate investment ', 'If the plan works, mobile operators will gain more customers and invest more in accessibility; phone makers will see people wanting better devices; Internet providers will get to connect more people; and people will receive affordable Internet so they can join the knowledge economy and connect with the people they care about.']

猜你喜欢

转载自blog.csdn.net/D0126_/article/details/143160935