[Python] WeChat automatic question and answer based on text matching


Summary

  Aiming at the low efficiency and narrow coverage of traditional rule-based and template-based question-answering systems when dealing with a large number of questions, this paper proposes a design scheme for question-answering robots based on text matching. By performing word segmentation and feature extraction on the question entered by the user, converting it into a vector form, and performing similarity matching with the pre-prepared question data set, the most similar question is found, and the corresponding answer is sent to the user. The experimental results show that the question-answering robot proposed in this paper has high accuracy and coverage, and can be effectively applied to various intelligent customer service, intelligent consulting and other fields.

Keywords: question answering robot, text matching, word segmentation, feature extraction, similarity matching


I. Introduction

  With the continuous development of Internet technology, artificial intelligence technology has been widely used in various fields. Question answering robot, as a kind of artificial intelligence technology, has broad application prospects, such as intelligent customer service, intelligent consultation and other fields. Traditional question answering systems are mainly based on rules and templates for matching, but due to the need to manually maintain rules and templates, narrow coverage, low efficiency and other issues, it limits its use in practical applications. Therefore, how to design an efficient, adaptive, and high-accuracy question-answering robot has become one of the research hotspots.
  This paper aims to propose a design scheme for question-answering robots based on text matching, which aims to convert the user-input question into a vector form through word segmentation and feature extraction, and perform similarity matching with the pre-prepared question data set. Find the most similar question and send the corresponding answer to the user. In the implementation, the Python programming language and related third-party libraries are used for development and implementation.


2. Related work

  Traditional question answering systems are mainly based on rules and templates for matching. This method needs to manually write rules and templates, so its application scenarios are limited and inefficient. In recent years, with the continuous development of natural language processing technology, question answering system based on text matching has become one of the research hotspots. Text matching technology is mainly divided into two types based on vocabulary matching and semantic matching. The method based on vocabulary matching mainly uses string matching algorithms, such as regular expressions, Levenshtein distance, etc., to carry out text matching. The advantage of this method is that it is simple and fast, but the disadvantage is that it cannot consider semantic information, so it is prone to matching errors.
  The method based on semantic matching pays more attention to semantic information, and can use natural language processing technology to analyze the semantics of the input question, and then match it with the pre-processed question library. At present, the commonly used semantic matching technologies mainly include matching methods based on word vectors and matching methods based on deep learning.
  This code uses a matching method based on word vectors. By expressing the input questions and the questions in the pre-processed question bank as word vectors, and then calculating the similarity between them, the best matching question and its corresponding answer. Among them, the Chinese word segmentation is performed through the jieba word segmentation tool, the feature extraction is performed through the CountVectorizer, and the similarity is finally calculated through np.dot.


3. Method description

  This code is mainly divided into three steps: data preprocessing, feature extraction, and text matching. The specific steps are as follows:

1. Data preprocessing

  First read in the question data and answer data, use the jieba word segmentation tool to perform Chinese word segmentation, and convert the word segmentation results into a string format separated by spaces to facilitate subsequent feature extraction.

2. Feature extraction

  Use CountVectorizer for feature extraction, and represent the question data after word segmentation as vectors.

3. Text matching

  Both the input questions and the questions in the pre-processed question bank are expressed as word vectors, and then the similarity between them is calculated to find the best matching question and its corresponding answer.


Fourth, the code part

The complete code is as follows:

import jieba
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from wxauto import WeChat

questionTXT = 'questions.txt'
answerTXT = 'answers.txt'

# 指定回答对象
wxchat = 'User'

# 数据预处理
questions = []
# 读入问题数据
with open(questionTXT, "r", encoding="utf-8") as f:
    for line in f:
        questions.append(line.strip())

questions = [jieba.lcut(q) for q in questions]
questions = [' '.join(q) for q in questions]

answers = []
# 读入回答数据
with open(answerTXT, "r", encoding="utf-8") as f:
    for line in f:
        answers.append(line.strip())

# 特征提取
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(questions)

# 获取当前微信客户端
wx = WeChat()

# 获取会话列表
wx.GetSessionList()

# 输出当前聊天窗口聊天消息
wx.ChatWith(wxchat)  # 打开聊天窗口

# 获取更多聊天记录
while True:
    if wx.GetLastMessage[0] == wxchat:
        msgs = wx.GetLastMessage[1]
        msgs1 = msgs
        msgs = jieba.lcut(msgs)
        msgs = ' '.join(msgs)
        test_X = vectorizer.transform([msgs])
        sims = np.dot(test_X[0], X.T)
        index = sims.argmax()
        if answers[index] == '好的,请告诉我你想添加什么':
            wx.SendMsg(answers[index])
            while True:
                if wx.GetLastMessage[0] == wxchat:
                    msgs = wx.GetLastMessage[1]
                    start_index = msgs1.find("Q:")
                    end_index = msgs1.find("A:")
                    if start_index == -1 or end_index == -1:
                        wx.SendMsg('你的格式输入有误,请重新告诉我。')
                        break
                    with open(questionTXT, "a", encoding="utf-8") as file:
                        file.write(msgs1[start_index + 2:end_index])
                    with open(answerTXT, "a", encoding="utf-8") as file:
                        file.write("\n" + msgs1[end_index + 2:])
                    msgs2 = msgs1[start_index + 2:end_index]
                    msgs2 = jieba.lcut(msgs2)
                    msgs2 = ' '.join(msgs2)
                    questions.append(msgs2)
                    answers.append(msgs1[end_index + 2:])
                    vectorizer = CountVectorizer()
                    X = vectorizer.fit_transform(questions)
                    wx.SendMsg('好的,我已经录入了此对话')
                    break
        else:
            wx.SendMsg(answers[index])

5. Experimental results

  This code interacts with the user through the WeChat client. After the user enters a question, the program will automatically match the most appropriate question and the corresponding answer to reply. After testing, the question answering system can better solve the user's problems, and the efficiency is high.
  The test results are shown in the figure below:

insert image description here


6. Summary

This code implements a question answering system based on word vectors, and realizes the function of automatically answering user questions through three steps of data preprocessing, feature extraction, and text matching. However, this paper only implements a simple question-and-answer matching mechanism, and there are still many shortcomings, such as the inability to understand the user's intentions and the inability to conduct multiple rounds of dialogue. Therefore, in the future, the question answering system can be further improved by combining deep learning and other technologies to increase its intelligence.

Guess you like

Origin blog.csdn.net/weixin_57807777/article/details/129103346