菜鸡的日志,非常简单和基础的内容,不确定会更新多少,目的是自动做英语阅读匹配题
涉及到的知识:
- python
- pandas和numpy库
- tf-idf的运用
需要准备
- 若干篇现成的英语阅读 和 答案(不是机器学习,只是自己测试准确率)
- 个人使用jupyter_notebook
cell 1 18/10/17 对文章拆分
import numpy as np
import pandas as pd
symbol=[".","'",'']
presymbol=[("\'s"," is"),('\'re'," are"),('n\'t'," not"),('\'ve',' have'),("\'m",' |am')##缩写
,("\'",""),(","," "),('\"',""),("?","."),("!",".")]
text=[]
with open("1.txt") as file:
text=file.read().strip().lower()
for j in presymbol:
text=text.replace(j[0],j[1])
paragraphs=text.split("\n")
raw_passage=[]
sentence=[]
for i in paragraphs:
if i!="": ##i是一段
temp_paragraph=[]
for j in i.split("."):
if j!="":
sentence=[ word for word in j.split(" ") if (word not in symbol and word!="")]
if sentence!=[]:
print (sentence)
temp_paragraph.append(sentence)
raw_passage.append(temp_paragraph)
拆分三个叠加的列表,依次为段落-句子-单词,需要预处理一下压缩的单词和一些符号,非常简单
cell 2 18/10/18 统计出现的单词
appeared_word=set()
for rp in raw_passage:
for rs in rp:
for word in rs:
appeared_word.add(word)
非常简单基础,就是为了便于直接底部加上appear_word查看,才作为一个单独的cell
cell 3 18/10/18 初始化DataFrame并计算单词个数
word_count=pd.DataFrame(np.zeros((len(raw_passage),len(appeared_word))).astype(int),index=range(len(raw_passage)+1),columns=appeared_word)
for i,rp in enumerate(raw_passage):
for rs in rp:
for word in rs:
word_count[word][i+1]+=1
words_data=pd.DataFrame(np.sum(np.array(word_count.iloc[1:]),axis=0).reshape((1,919)),index=[1],columns=appeared_word).append(word_count)
由于个人比较菜,这块花了好久查阅pandas的函数,可能有更简便的写法
为了降低在文本匹配时少受到and这类词影响,可以使用tf-idf