C#, principles and implementation code of full-text retrieval, No-Lucene

 Document Management Series Technical Articles

Core technologies and difficulties of document management systems https://blog.csdn.net/beijinghorn/article/details/122426112 PB-level full-text retrieval (distributed) solution - HyperSearch https://blog.csdn.net/beijinghorn/article /details/122377760

Overview

 Full-text search is the core function of document management systems. 

There are actually many ways to achieve full-text retrieval, including but not limited to full-text search technology by establishing an "inverted sort index". Of course, "inverted sorted index" is the mainstream and has high efficiency. For example, Lucene and its successor ES, which started a long time ago and are technologically backward, still have good room for development. This article uses very little code to implement full-text retrieval based on "inverted sort index" technology. Let's experience the little toy.

1. Basic process of full-text retrieval

Implementing full-text retrieval requires three core steps:

(1) Construction: Obtain data records and establish an inverted sorting index of text information;

(2) Usage: For each search term, obtain its inverse sorting index information, and perform the "intersection" operation of the set;

(3) Display: Create display results based on the final index information;

For beginners, it is enough to understand the principles of full-text retrieval.

But for application software, a more complete process is needed.

2. Advanced process of full-text search

The high-level process is divided into " system construction period " and " application period ".

System construction period:

(1) Obtain original data records and establish an inverted sorting index of text information;

(2) Save all index information to files for easy reading by applications;

System application period:

(1) Read index information;

(2) Add, delete (modify) data records; add, delete (modify) index information synchronously or asynchronously , and save it to the index file. In order to ensure system efficiency, asynchronous form is generally used. Large-scale systems implement distributed storage and distributed search through application servers.

(3) Word segmentation and grammatical processing of search statements;

(4) For each search term, obtain its inverse sorting index information and perform "intersection", "merge" and "difference" operations on the set;

(5) Create a result display based on the final index information;

3. The principle source code of the experimental full-text retrieval system

Let’s look at the effect first.

Then upload the complete source code.

using System;
using System.IO;
using System.Text;
using System.Collections;
using System.Collections.Generic;
using System.Windows.Forms;

namespace WindowsFormsApp3
{
    public partial class Form1 : Form
    {
        string sourceFolder = String.Empty;
        Hashtable hashIndex = new Hashtable();
        List<DocumentInfo> documentList = new List<DocumentInfo>();
        public Form1()
        {
            InitializeComponent();
        }
        private void Form1_Load(object sender, EventArgs e)
        {
        }
        private void button1_Click(object sender, EventArgs e)
        {
            sourceFolder = Path.Combine(Application.StartupPath, @"Text");
            DirectoryInfo root = new DirectoryInfo(sourceFolder);
            FileInfo[] xfiles = root.GetFiles();
            hashIndex.Clear();
            documentList.Clear();
            StringBuilder sb = new StringBuilder();
            foreach (FileInfo xfile in xfiles)
            {
                DocumentInfo dx = new DocumentInfo(documentList.Count, xfile.FullName);
                documentList.Add(dx);
                CreateIndex(dx);
                sb.AppendLine(dx.Filename + "<br>");
            }
            button1.Enabled = false;
            button2.Visible = true;
            textBox1.Visible = true;
            sb.AppendLine("索引创建完成!<br>");
            webBrowser1.DocumentText = sb.ToString();
        }
        private void CreateIndex(DocumentInfo docInfo)
        {
            // 创建文本文件索引信息,构建“倒序索引表”
            string buf = File.ReadAllText(docInfo.Filename);
            buf = buf.Replace(" ", " ").ToLower();
            for (int i = 0; i < buf.Length; i++)
            {
                string bs = buf.Substring(i, 1).Trim();
                if (bs.Length == 0) continue;
                if (Char.IsPunctuation(bs[0])) continue;
                if (hashIndex.ContainsKey(bs))
                {
                    List<IndexInfo> index = (List<IndexInfo>)hashIndex[bs];
                    index.Add(new IndexInfo(docInfo.Id, i));
                }
                else
                {
                    List<IndexInfo> index = new List<IndexInfo>();
                    index.Add(new IndexInfo(docInfo.Id, i));
                    hashIndex.Add(bs, index);
                }
            }
        }
        private void button2_Click(object sender, EventArgs e)
        {
            string queryString = textBox1.Text.Trim().ToLower();
            if (queryString.Length == 0) return;
            List<IndexInfo> result = new List<IndexInfo>();
            for (int i = 0; i < queryString.Length; i++)
            {
                string bs = queryString.Substring(i, 1).Trim();
                if (bs.Length == 0) continue;
                if (!hashIndex.ContainsKey(bs)) { break; }
                List<IndexInfo> rx = (List<IndexInfo>)hashIndex[bs];
                if (result.Count == 0)
                {
                    result.AddRange(rx);
                }
                else
                {
                    // 数据多,会慢。仅为说明原理,用暴力匹配算法!
                    List<IndexInfo> ry = new List<IndexInfo>();
                    for (int j = 0; j < rx.Count; j++)
                    {
                        for (int k = 0; k < result.Count; k++)
                        {
                            if (result[k].Id == rx[j].Id)
                            {
                                if (rx[j].Position == (result[k].Position + 1))
                                {
                                    ry.Add(rx[j]);
                                }
                            }
                        }
                    }
                    result = ry;
                    if (result.Count == 0) { break; }
                }
            }
            if (result.Count == 0) { webBrowser1.DocumentText = "无!"; return; }
            // 按位置倒序
            result.Sort(delegate (IndexInfo a, IndexInfo b) { return Comparer<int>.Default.Compare(a.Position, b.Position); });
            webBrowser1.DocumentText = ResultShow(queryString,result);
        }
        private string ResultShow(string queryString, List<IndexInfo> result)
        {
            int left_span = 16;
            int span_length = 256;
            Hashtable hx = new Hashtable();
            StringBuilder sb = new StringBuilder();
            foreach (IndexInfo record in result)
            {
                if (hx.ContainsKey(record.Id) == false)
                {
                    DocumentInfo dx = documentList.Find(t => t.Id == record.Id);
                    string buf = File.ReadAllText(dx.Filename).Replace(" ", " ");
                    sb.AppendLine("<a href=''><h2>" + dx.Filename.Substring(sourceFolder.Length + 1) + "</h2></a>");
                    int s1 = record.Position < left_span ? 0 : record.Position - left_span;
                    int s2 = ((s1 + span_length) < buf.Length) ? (s1 + span_length) : (s1 + (buf.Length - s1));
                    sb.AppendLine(buf.Substring(s1, s2 - s1) + "<br><br>");
                    hx.Add(record.Id, true);
                }
            }
            return sb.ToString().Replace(queryString, "<font color=red>" + queryString + "</font>");
        }
    }
    public class DocumentInfo
    {
        public int Id { get; set; } = 0;
        public string Filename { get; set; } = String.Empty;
        public DocumentInfo(int id, string filename) { Id = id; Filename = filename; }
    }
    public class IndexInfo
    {
        public int Id { get; set; } = 0;
        public int Position { get; set; } = 0;
        public IndexInfo(int id, int position) { Id = id; Position = position; }
    }
}

Supports search for (single) Chinese characters and English words (partial, regardless of case). If ToLower() is removed, case-sensitive searches can be supported.

Also:

After reading the code, you may ask? How can full-text search be performed without word segmentation?

Uncle’s understanding: Today, when new words are changing with each passing day, today, when information is gradually fragmented, today, when internal and external storage are cheap, and today, when computing power is greatly abundant, word segmentation has no meaning. In particular, the research and development of word segmentation for Unicode dual codes (such as Chinese characters) is a waste of time and a waste of paper volume.

 ——————————————————————

POWER BY 315SOFT.COM &
TRUFFER.CN

Guess you like

Origin blog.csdn.net/beijinghorn/article/details/123786336