用python拆分文章

菜鸡的日志,非常简单和基础的内容,不确定会更新多少,目的是自动做英语阅读匹配题

涉及到的知识:
  • python
  • pandas和numpy库
  • tf-idf的运用
需要准备
  • 若干篇现成的英语阅读 和 答案(不是机器学习,只是自己测试准确率)
  • 个人使用jupyter_notebook
cell 1 18/10/17 对文章拆分
import numpy as np
import pandas as pd
symbol=[".","'",'']
presymbol=[("\'s"," is"),('\'re'," are"),('n\'t'," not"),('\'ve',' have'),("\'m",' |am')##缩写
                   ,("\'",""),(","," "),('\"',""),("?","."),("!",".")]

text=[]
with open("1.txt") as file:
    text=file.read().strip().lower()
    for j in presymbol:
        text=text.replace(j[0],j[1])
    paragraphs=text.split("\n")
    
raw_passage=[]
sentence=[]
for i in paragraphs:
    if i!="":   ##i是一段
        temp_paragraph=[]
        for j in i.split("."):
            if j!="":
                sentence=[ word for word in j.split(" ") if (word not in symbol and word!="")]
                if sentence!=[]:
                    print (sentence)
                    temp_paragraph.append(sentence)
        raw_passage.append(temp_paragraph)

拆分三个叠加的列表,依次为段落-句子-单词,需要预处理一下压缩的单词和一些符号,非常简单

cell 2 18/10/18 统计出现的单词
appeared_word=set()
for rp in raw_passage:
    for rs in rp:
        for word in rs:
            appeared_word.add(word)

非常简单基础,就是为了便于直接底部加上appear_word查看,才作为一个单独的cell

cell 3 18/10/18 初始化DataFrame并计算单词个数
word_count=pd.DataFrame(np.zeros((len(raw_passage),len(appeared_word))).astype(int),index=range(len(raw_passage)+1),columns=appeared_word)
for i,rp in enumerate(raw_passage):
    for rs in rp:
        for word in rs:
            word_count[word][i+1]+=1
words_data=pd.DataFrame(np.sum(np.array(word_count.iloc[1:]),axis=0).reshape((1,919)),index=[1],columns=appeared_word).append(word_count)
由于个人比较菜,这块花了好久查阅pandas的函数,可能有更简便的写法


为了降低在文本匹配时少受到and这类词影响,可以使用tf-idf

猜你喜欢

转载自www.cnblogs.com/bot-noob-121/p/9807712.html