Principle of inverted index

Inverted index

 

1 Introduction

The inverted index originates from the need to find records according to the value of the attribute in practical applications. Each entry in such an index table includes an attribute value and the address of each record with that attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index. A file with an inverted index is called an inverted index file, or inverted file for short.

Inverted file (inverted index), the index object is the word in the document or document collection, etc., used to store the storage location of these words in a document or a group of documents, is one of the most commonly used documents or document collections indexing mechanism.

The key step for a search engine is to establish an inverted index. The inverted index is generally expressed as a keyword, followed by its frequency (the number of times it appears), its location (in which article or web page it appears, and the relevant date, Author and other information), it is equivalent to an index for hundreds of billions of web pages on the Internet, just like a book's directory and tags. Readers who want to see a chapter related to a topic can find the relevant page directly according to the table of contents. There is no need to search page by page from the first page to the last page of the book.

 

2. Lucene inverted index principle

Lucerne is a high-performance JAVA full-text search engine toolkit developed with source code. It is not a complete full-text search engine, but a full-text search engine architecture, providing a complete query engine, indexing engine, and partial text analysis engine. The purpose is to provide a simple and easy-to-use toolkit for software developers to facilitate the realization of full-text retrieval functions in the target system, or to build a complete full-text retrieval engine based on this.

Lucerne uses an inverted file index structure. The structure and the corresponding generation algorithm are as follows:  

Features two articles 1 and 2:
The content of Article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.
The content of Article 2 is: He once lived in Shanghai.

 

<1> Get keywords

Since lucene is based on keyword indexing and querying, first of all we need to obtain the keywords of these two articles, usually we need the following measures:

a. What we have now is the content of the article, that is, a string. We must first find all the words in the string, that is, word segmentation. English words are easier to handle because they are separated by spaces. Chinese words are linked together and require special word segmentation.
b. Words such as "in", "once" and "too" in the article have no practical meaning, and words such as "de" and "is" in Chinese usually have no specific meaning. These words that do not represent concepts can be filtered out
c. Users usually want to find out the articles containing "he" and "HE" when looking up "He", so all words need to be capitalized.
d. Users usually want to find articles containing "lives" and "lived" when checking "live", so they need to restore "lives" and "lived" to "live"
e. Punctuation in articles usually does not represent a concept and can also be filtered out

 In lucene, the above measures are done by the Analyzer class. After the above processing,

All keywords of Article 1 are: [tom] [live] [guangzhou] [i] [live][guangzhou] All keywords of Article 2 are: [he] [live] [shanghai]

 

<2> Create an inverted index

 With the keywords, we can create an inverted index. The above correspondence is: "article number" to "all keywords in the article". The inverted index reverses this relationship and becomes: "keyword" versus "all article numbers with that keyword" .

Articles 1 and 2 become

Keyword article number
guangzhou        1
he               2
i                1
live             1,2
shanghai         2
tom              1

 Usually it is not enough to know which articles the keywords appear in. We also need to know the number of times and where the keywords appear in the articles. There are usually two kinds of positions:

a. Character position , that is, record the number of characters in the article that the word is (the advantage is that the keyword is positioned quickly when it is highlighted);

b. Keyword position , that is, record the number of keywords in the article (the advantage is that the index space is saved, and the phrase (phase) query is fast), which is recorded in lucene.

After adding the "occurrence frequency" and "occurrence location" information, our index structure becomes:

Keyword Article No. [Frequency of Appearance] Location of Appearance
guangzhou            1[2]                     3,6
he                   2[1]                     1
i                    1[1]                     4
live                 1[2]                     2,5,
                     2[1]                     2
shanghai             2[1]                     3
tom                  1[1]                     1

 Take the live line as an example to illustrate the structure: live appears twice in article 1 and once in article 2, and its appearance position is "2,5,2" What does it mean? We need to analyze the article number and the frequency of occurrence. If article 1 appears twice, then "2,5" means the two positions where live appears in article 1, once in article 2, and the remaining "2" " means live is the second keyword in article 2.

 The above is the core part of the lucene index structure. We noticed that the keywords are arranged in character order (lucene does not use a B-tree structure), so lucene can quickly locate keywords with a binary search algorithm.

 

<3> Realization

When implemented, lucene saves the above three columns as a dictionary file (Term Dictionary), a frequency file (frequencies), and a position file (positions) respectively. The dictionary file not only stores each keyword, but also retains pointers to frequency files and location files, through which the frequency information and location information of the keyword can be found.

The concept of field is used in Lucene to express the location of the information (such as in the title, in the article, in the url). During indexing, the field information is also recorded in the dictionary file, and each keyword has a field information (Because each keyword must belong to one or more fields).

 

<4> Compression algorithm

To reduce the size of the index file, Lucene also uses compression techniques for the index.

First, the keywords in the dictionary file are compressed, and the keywords are compressed to <prefix length, suffix>, for example: the current word is "Arabic", the previous word is "Arabic", then "Arabic" is compressed to < 3, language>.

Second, the compression of the number is used a lot, and the number only stores the difference from the previous value (this can reduce the length of the number, thereby reducing the number of bytes required to store the number). For example, the current article number is 16389 (uncompressed and stored in 3 bytes), the previous article number is 16382, and 7 is saved after compression (only one byte is used).

 

<5> Application reasons

Below we can explain why we need to build an index by querying the index.

Suppose you want to query the word "live", lucene first searches the dictionary binary, finds the word, reads all the article numbers through the pointer to the frequency file, and then returns the result. Dictionaries are usually very small, so the entire process takes milliseconds.

With the ordinary sequential matching algorithm, instead of indexing, string matching is performed on the content of all articles. This process will be quite slow. When the number of articles is large, the time is often unbearable.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326066881&siteId=291194637