KNN、Rocchio文本分类
- 问题描述
利用训练集(Doc1-Doc7)中的文档构造KNN文本分类模型与Rocchio文本分类模型,对测试集(Doc8-Doc9)进行文本分类,其中抽取tf-idf特征向量时采取‘ltn’的形式处理。 - 文本向量化表示
首先,按照文档-词语矩阵的表格将文本以词频的形式表示
# train set
d1 = [2, 0, 4, 3, 0, 1, 0, 2]
d2 = [0, 2, 4, 0, 2, 3, 0, 0]
d3 = [4, 0, 1, 3, 0, 1, 0, 1]
d4 = [0, 1, 0, 2, 0, 0, 1, 0]
d5 = [0, 0, 2, 0, 0, 4, 0, 0]
d6 = [1, 1, 0, 2, 0, 1, 1, 3]
d7 = [2, 1, 3, 4, 0, 2, 0, 2]
# test set
d8 = [3, 1, 0, 4, 1, 0, 2, 1]
d9 = [0, 0, 3, 0, 1, 5, 0, 1]
再根据题目中的分类方式,将文档记录到不同的类别列表中
# category
c1 = [d1, d2, d5]
c2 = [d3, d4, d6, d7]
c1_id = [0, 1, 4]
c2_id = [2, 3, 5, 6]
按照题目要求,采用‘t’公式(idf=log(N/df) 其中N为文档数,df为词语的文档频率)表示词语的idf。
1. 经过统计可得词语的df(文档频率)
df:
[4, 4, 5, 5, 1, 6, 2, 4]
- 利用公式与文档频率(df)得出词语的idf
def get_t_idf(df):
N = len(c1+c2)
return math.log10( N / df)
idf:
[0.24303804868629444, 0.24303804868629444, 0.146128035678238, 0.146128035678238, 0.8450980400142568, 0.06694678963061322, 0.5440680443502757, 0.24303804868629444]
3.将文档词频表示根据公式‘l’(tf=(1+log(tf))表示文档的词频向量
def get_l_tf(tf):
if tf == 0:
return 0.0
else:
return (1 + math.log10(tf))
tf1 [1.3010299956639813, 0.0, 1.6020599913279625, 1.4771212547196624, 0.0, 1.0, 0.0, 1.3010299956639813]
tf2 [0.0, 1.3010299956639813, 1.6020599913279625, 0.0, 1.3010299956639813, 1.4771212547196624, 0.0, 0.0]
tf3 [1.6020599913279625, 0.0, 1.0, 1.4771212547196624, 0.0, 1.0, 0.0, 1.0]
tf4 [0.0, 1.0, 0.0, 1.3010299956639813, 0.0, 0.0, 1.0, 0.0]
tf5 [0.0, 0.0, 1.3010299956639813, 0.0, 0.0, 1.6020599913279625, 0.0, 0.0]
tf6 [1.0, 1.0, 0.0, 1.3010299956639813, 0.0, 1.0, 1.0, 1.4771212547196624]
tf7 [1.3010299956639813, 1.0, 1.4771212547196624, 1.6020599913279625, 0.0, 1.3010299956639813, 0.0, 1.3010299956639813]
将tf与idf相乘获得ltn表示的tf-idf文档向量表示
def doc2_ltn_vec(doc, t_idf_array):
l_tf_array = []
for tf in doc:
l_tf_array.append(get_l_tf(tf))
l_tf_array = np.array(l_tf_array)
t_idf_array = np.array(t_idf_array)
# print("tf",list(l_tf_array))
# print("idf",list(t_idf_array))
return l_tf_array * t_idf_array
文档向量:
Doc1: [0.3161997914285121, 0.0, 0.23410587957145018, 0.21584882741075853, 0.0, 0.06694678963061322, 0.0, 0.3161997914285121]
Doc2: [0.0, 0.3161997914285121, 0.23410587957145018, 0.0, 1.0994978993353877, 0.09888852589862468, 0.0, 0.0]
Doc3: [0.38936153417072983, 0.0, 0.146128035678238, 0.21584882741075853, 0.0, 0.06694678963061322, 0.0, 0.24303804868629444]
Doc4: [0.0, 0.24303804868629444, 0.0, 0.1901169576248441, 0.0, 0.0, 0.5440680443502757, 0.0]
Doc5: [0.0, 0.0, 0.1901169576248441, 0.0, 0.0, 0.10725277321505515, 0.0, 0.0]
Doc6: [0.24303804868629444, 0.24303804868629444, 0.0, 0.1901169576248441, 0.0, 0.06694678963061322, 0.5440680443502757, 0.35899666742011765]
Doc7: [0.3161997914285121, 0.24303804868629444, 0.21584882741075853, 0.23410587957145018, 0.0, 0.08709978142283419, 0.0, 0.3161997914285121]
同理,tf-idf向量表示文档8和文档9
Doc8 [0.35899666742011765, 0.24303804868629444, 0.0, 0.23410587957145018, 0.8450980400142568, 0.0, 0.7078488453819499, 0.24303804868629444]
Doc9 [0.0, 0.0, 0.21584882741075853, 0.0, 0.8450980400142568, 0.11374058746900548, 0.0, 0.24303804868629444]
- KNN
取k=3
通过余弦相似度的计算得出与测试文本向量最相似的k篇文档
def get_k_neighbour(test_vec, train_vec_list, k):
sim_dict = {}
for i in range(len(train_vec_list)):
train_vec = train_vec_list[i]
sim_dict[i] = cosine_similarity([test_vec], [train_vec])
#print(sim_dict)
sorted_dict = sorted(sim_dict.items(), key=lambda d: d[1], reverse=True)[:k]
return sorted_dict
top 3 nearest neighbour for doc8: ['Doc6 score:0.7047938452340027', 'Doc2 score:0.6969507473943418', 'Doc4 score:0.6343414204722811']
class number: 2
top 3 nearest neighbour for doc9: ['Doc2 score:0.9265785110426313', 'Doc1 score:0.2674857884080646', 'Doc5 score:0.2672478501538577']
class number: 1
可见与doc8最相近的是doc2 doc4 doc6它们的余弦相似度得分如上,
与doc9最相近的是doc2 doc1 doc5它们的余弦相似度得分如上,
由
c1 = [d1, d2, d5]
c2 = [d3, d4, d6, d7]
可得,doc1 2 5 属于c1, doc4 6属于c2
综合得分情况,找出得分最高的类别:
def get_class(k_neighbour, class_2_doc_id_list):
class_num = len(class_2_doc_id_list)
score_list = [0] * class_num
for n in k_neighbour:
# print("n_neighbour",n)
for i in range(len(class_2_doc_id_list)):
if n[0] in class_2_doc_id_list[i]:
# print(n[0],class_2_doc_id_list[i])
score = score_list[i]
# print("score:",n[1][0])
score_list[i] = score + n[1][0][0]
print(score_list)
return np.argmax(score_list) + 1
doc8 在c1 的得分为0.6969507473943418, c2 的得分为1.3391352657062838 属于c2
doc9 在c1 的得分为1.4613121496045536, c2 的得分为0属于c1
- *Rocchio
由
c1 = [d1, d2, d5]
c2 = [d3, d4, d6, d7]
可得,各类别下的文档,通过计算文档tf-idf向量的平均值可以得到各个类别的质心
def get_centroid(train_vec_list, class_2_doc_id_list):
centroid_list = []
for i in range(len(class_2_doc_id_list)):
centorid = np.array([0.0] * train_vec_list[0].shape[0])
for doc_id in class_2_doc_id_list[i]:
centorid += train_vec_list[doc_id]
centorid = centorid / len(class_2_doc_id_list[i])
print("centorid" + str(i + 1) + ": ", list(centorid))
centroid_list.append(centorid)
return centroid_list
centorid1: [0.1053999304761707, 0.1053999304761707, 0.21944290558924817, 0.07194960913691952, 0.3664992997784626, 0.09102936291476436, 0.0, 0.1053999304761707]
centorid2: [0.23714984357138408, 0.18227853651472084, 0.09049421577224914, 0.2075471555579742, 0.0, 0.05524834017101515, 0.27203402217513784, 0.22955862688373108]
采用余弦相似度,计算新文档与两个质心的相似性,找到最相似的质心
def predict_with_cosine(test,contorid_list):
distance_list = []
test_vec = doc2_ltn_vec(test, t_idf_array)
for c in centroid_list:
distance_list.append(cosine_similarity([test_vec],[c]))
return np.argmax(distance_list) + 1
通过相似性的大小(找出相似性大的)确定文本的类别
扫描二维码关注公众号,回复:
1373438 查看本文章
sim: [array([[0.70476959]]), array([[0.66561148]])]
d8 belongs to class1
sim: [array([[0.89955079]]), array([[0.17194883]])]
d9 belongs to class1