Chinese text similarity calculation based on text2vec (solving simple comparison problems)


The main problem to be solved in this paper is to give a corpus, where the corpus records the comparison problem, and then the user enters the text, and calculate which text in the corpus is most similar to the user's input.
The data and code of this article have been uploaded to github: https://github.com/UserXiaohu/chinese_similarity

Data format and content

  1. The corpus is mainly used as basic data for reference and comparison. The data format is as follows (base_content.csv):
 ,key_text
0,我今天用了支付宝买了东西
1,我今天用了微信买了东西
2,今天上班遇到一个漂亮的女孩,她长的很好看。
3,今天上班遇到一个帅气的男孩,他长的很帅气。
4,早上过马路遇到一个老奶奶,我扶她过了马路
5,早上过马路遇到一个老爷爷,我扶他过了马路
  1. Enter the file to be compared, and record the user input string for comparison with the corpus. The data format is as follows (demo.csv):
,key_text
text
遇到一个女孩,他很漂亮
老奶奶摔倒了,我扶她起来

Code flow and design

Load and read data

Here we mainly load and read the corpus and the file to be compared. I have defined a reading method as follows.

"""
加载初始数据信息
str:文件传输路径
index:所需真实值索引列表
"""
def init_data(str, index):
  dream_data = pd.read_csv(str)
  return dream_data.values[:, index]

Processing text data

Processing text data is mainly to remove stop words and words (if the words are in the keywords, they are not removed). Here, some stop words and keyword libraries are written in the global, and the method is not defined.

"""
初始化获取停用词表
"""
stop = open('stop_word.txt', 'r+', encoding='utf-8')
stopword = stop.read().split("\n")
key = open('key_word.txt', 'r+', encoding='utf-8')
keyword = key.read().split("\n")

"""
对文本内容进行过滤
1。过滤停用词
2。结合关键词/字过滤
"""
rd = key.read().split("\n")

def strip_word(seg):
  # 打开写入关键词的文件
  jieba.load_userdict("./key_word.txt")
  # print("去停用词:\n")
  wordlist = []
  # 获取关键字
  keywords = jieba.analyse.extract_tags(seg, topK=5, withWeight=False, allowPOS=('n'))
  # 遍历分词表
  for key in jieba.cut(seg):
      # print(key)
      # 去除停用词,去除单字且不在关键词库,去除重复词
      if not (key.strip() in stopword) and (len(key.strip()) > 1 or key.strip() in keyword) and not (
              key.strip() in wordlist):
          wordlist.append(key)
          # print(key)
  # 停用词去除END
  stop.close()
  return ''.join(wordlist)

Text data comparison

Compare the input data with the corpus and record.

"""
通过text2vec词向量模型计算
出来两段处理后的文本相似度
"""


def similarity_calculation(str_arr, str_2):
  sim = Similarity()
  str_2 = strip_word(str_2)
  result = []
  for item in str_arr:
      #这里可以将base提前处理好导出备用,以达到优化目的
      item = strip_word(item)
      result.append(sim.get_score(item, str_2))
  return result

All code examples

from text2vec import Similarity
import jieba
import heapq
import jieba.analyse
import pandas as pd

"""
初始化获取停用词表
"""
stop = open('stop_word.txt', 'r+', encoding='utf-8')
stopword = stop.read().split("\n")
key = open('key_word.txt', 'r+', encoding='utf-8')
keyword = key.read().split("\n")

"""
加载初始数据信息
str:文件传输路径
index:所需真实值索引列表
"""


def init_data(str, index):
  dream_data = pd.read_csv(str)
  return dream_data.values[:, index]

"""
对文本内容进行过滤
1。过滤停用词
2。结合关键词/字过滤
"""
rd = key.read().split("\n")

def strip_word(seg):
  # 打开写入关键词的文件
  jieba.load_userdict("./key_word.txt")
  # print("去停用词:\n")
  wordlist = []
  # 获取关键字
  keywords = jieba.analyse.extract_tags(seg, topK=5, withWeight=False, allowPOS=('n'))
  # 遍历分词表
  for key in jieba.cut(seg):
      # print(key)
      # 去除停用词,去除单字且不在关键词库,去除重复词
      if not (key.strip() in stopword) and (len(key.strip()) > 1 or key.strip() in keyword) and not (
              key.strip() in wordlist):
          wordlist.append(key)
          # print(key)
  # 停用词去除END
  stop.close()
  return ''.join(wordlist)



"""
通过text2vec词向量模型计算
出来两段处理后的文本相似度
"""


def similarity_calculation(str_arr, str_2):
  sim = Similarity()
  str_2 = strip_word(str_2)
  result = []
  for item in str_arr:
      #这里可以将base提前处理好导出备用,以达到优化目的
      item = strip_word(item)
      result.append(sim.get_score(item, str_2))
  return result

"""
将用户细节文本描述
转换为关键词文本
"""

def deal_init_data(text_data):
  text_arr = []
  for item in text_data:
      # 做关键词提取
      text_arr.append(strip_word(item))
  key_words = pd.DataFrame(text_arr, columns=['key_text'])
  key_words.to_csv('base_content.csv', sep=',', header=True, index=True)
  return key_words
  


if __name__ == '__main__':
  # 读取文本的对比数据关键词
  key_arr = init_data('base_content.csv', 1);
  # 读取文本的对比数据关键词
  demo_arr = init_data('demo.csv', 0);
  #循环对比每一个输入
  for index,item in enumerate(demo_arr):
      result = similarity_calculation(key_arr,item)    
       # 获取相似度最高的前三个
      re1 = map(result.index, heapq.nlargest(3, result))
      re2 = heapq.nlargest(3, result)
      print("原文本",item)
      for i, val in enumerate(list(re1)):
          print(i+1,".对比结果:",key_arr[val],",相似度:",re2[i])

Note: This use requires pip to install text2vec

operation result:

原文本 遇到一个女孩,他很漂亮
1 .对比结果: 今天上班遇到一个漂亮的女孩,她长的很好看。 ,相似度: 0.9014007595177241
2 .对比结果: 今天上班遇到一个帅气的男孩,他长的很帅气。 ,相似度: 0.8269612688009047
3 .对比结果: 早上过马路遇到一个老奶奶,我扶她过了马路 ,相似度: 0.7627239914767185

原文本 老奶奶摔倒了,我扶她起来
1 .对比结果: 早上过马路遇到一个老奶奶,我扶她过了马路 ,相似度: 0.8301321157294765
2 .对比结果: 早上过马路遇到一个老爷爷,我扶他过了马路 ,相似度: 0.8151511340475789
3 .对比结果: 今天上班遇到一个帅气的男孩,他长的很帅气。 ,相似度: 0.6141663077445291

原文本 早晨用微信买了个包子
1 .对比结果: 我今天用了微信买了东西 ,相似度: 0.7883183438706154
2 .对比结果: 我今天用了支付宝买了东西 ,相似度: 0.7377135198420246
3 .对比结果: 早上过马路遇到一个老爷爷,我扶他过了马路 ,相似度: 0.7074578367274513

Guess you like

Origin blog.csdn.net/m0_47220500/article/details/106059669