Object-oriented Python (Part 2): How to implement a search engine?

We assume that the search sample exists on the local disk. For convenience, we only provide retrieval of five files, and I put the content in the following code:


# 1.txt
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

# 2.txt
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# 3.txt
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# 4.txt
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# 5.txt
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

Let's first define the SearchEngineBase base class.


class SearchEngineBase(object):
    def __init__(self):
        pass

    def add_corpus(self, file_path):
        with open(file_path, 'r') as fin:
            text = fin.read()
        self.process_corpus(file_path, text)

    def process_corpus(self, id, text):
        raise Exception('process_corpus not implemented.')

    def search(self, query):
        raise Exception('search not implemented.')

def main(search_engine):
    for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
        search_engine.add_corpus(file_path)

    while True:
        query = input()
        results = search_engine.search(query)
        print('found {} result(s):'.format(len(results)))
        for result in results:
            print(result)

SearchEngineBase can be inherited, and the inherited classes represent different algorithm engines. Each engine should implement the process_corpus() and search() functions, corresponding to the indexer and retriever we just mentioned. The main() function provides the searcher and user interface, so a simple wrapper interface is available.

Look at this code specifically, where,

The add_corpus() function is responsible for reading the content of the file, using the file path as the ID and sending it to process_corpus along with the content.
process_corpus needs to process the content, and then the file path is ID, and the processed content is saved. The processed content is called an index.
search gives a query, processes the query, retrieves it through the index, and then returns.

Next, we implement a most basic working search engine, the code is as follows:


class SimpleEngine(SearchEngineBase):
    def __init__(self):
        super(SimpleEngine, self).__init__()
        self.__id_to_texts = {
    
    }

    def process_corpus(self, id, text):
        self.__id_to_texts[id] = text

    def search(self, query):
        results = []
        for id, text in self.__id_to_texts.items():
            if query in text:
                results.append(id)
        return results

search_engine = SimpleEngine()
main(search_engine)


########## 输出 ##########


simple
found 0 result(s):
little
found 2 result(s):
1.txt
2.txt

Let's take a look at this code:

SimpleEngine implements a subclass that inherits SearchEngineBase, inherits and implements the process_corpus and search interfaces, and at the same time, it also inherits the add_corpus function (of course you want to rewrite it is also feasible), so we can call it directly in the main() function .

In our new constructor, self.__id_to_texts = {} initializes its own private variables, which is the dictionary used to store file names to file contents.

The process_corpus() function inserts the contents of the file into the dictionary very straightforwardly. Note here that the ID needs to be unique, otherwise the new content with the same ID will overwrite the old content.

search Directly enumerate the dictionary and find the string to be searched. If it can be found, put the ID in the result list and return at the end.

You see, is it very simple? This process always runs through object-oriented thinking. Here I have sorted out a few questions for you. You can think about it for yourself as a small review.

Now you should be more clear about the calling order and methods of the constructors of the parent class and subclasses, right?
How is the function rewritten during integration?
How does the base class act as an interface (you can delete the overridden function in the subclass yourself, or modify the parameters of the function to see what errors will be reported)?
How are methods and variables connected?

Okay, let's go back to the topic of search engines.

I believe you can also see that this implementation is simple, but it is obviously a very inefficient way: it takes up a lot of space after each index, because the index function does not do anything; each retrieval takes a lot of time , Because all files in the index library have to be searched again. If the amount of information of the corpus is regarded as n, then the time complexity and space complexity here should be O(n) level.

Moreover, there is another problem: the query here can only be one word, or several words connected together. If you want to search for multiple words, and they are scattered in different places in the article, our simple engine is powerless.

How should it be optimized at this time?

The most straightforward idea is to treat the word segmentation of the corpus as a vocabulary, so that you only need to store a set of all its vocabulary for each article. According to Zipf's law, in the natural language corpus, the frequency of a word's appearance is inversely proportional to its ranking in the frequency table, showing a power-law distribution. Therefore, the practice of corpus word segmentation can greatly improve our storage and search efficiency.

Bag of Words 和 Inverted Index

Let's first implement a search model called Bag of Words. Please look at the following code:


import re

class BOWEngine(SearchEngineBase):
    def __init__(self):
        super(BOWEngine, self).__init__()
        self.__id_to_words = {
    
    }

    def process_corpus(self, id, text):
        self.__id_to_words[id] = self.parse_text_to_words(text)

    def search(self, query):
        query_words = self.parse_text_to_words(query)
        results = []
        for id, words in self.__id_to_words.items():
            if self.query_match(query_words, words):
                results.append(id)
        return results
    
    @staticmethod
    def query_match(query_words, words):
        for query_word in query_words:
            if query_word not in words:
                return False
        return True

    @staticmethod
    def parse_text_to_words(text):
        # 使用正则表达式去除标点符号和换行符
        text = re.sub(r'[^\w ]', ' ', text)
        # 转为小写
        text = text.lower()
        # 生成所有单词的列表
        word_list = text.split(' ')
        # 去除空白单词
        word_list = filter(None, word_list)
        # 返回单词的 set
        return set(word_list)

search_engine = BOWEngine()
main(search_engine)


########## 输出 ##########


i have a dream
found 3 result(s):
1.txt
2.txt
3.txt
freedom children
found 1 result(s):
5.txt

Here we first understand a concept, BOW Model, which is called bag-of-words model in Chinese. This is one of the most common and simple models in the NLP field.

Suppose a text does not consider grammar, syntax, paragraphs, or the order in which words appear, and only regards this text as a collection of these words. Therefore, correspondingly, we replaced id_to_texts with id_to_words, so that we only need to store these words instead of all the articles, and we don’t need to consider the order.

Among them, the process_corpus() function calls the class static function parse_text_to_words to break the article into a bag of words, put it in the set and then put it in the dictionary.

The search() function is slightly more complicated. Here we assume that the desired result is that all search keywords must appear in the same article. Then, we need to also break the query to get a set, and then check each word in the set with each article in our index to see if the word we are looking for is in it. The static function query_match is responsible for this process.

We see that these two functions are stateless, they do not involve private variables of the object (without self as a parameter), and the same input can get exactly the same output result. Therefore, it is set to be static, which is convenient for other classes to use.

However, even if you do this, you still need to traverse all IDs every time you query. Although it has saved a lot of time compared to the Simple model, there are hundreds of millions of pages on the Internet, and the cost of traversing all of them each time is still too high. At this point, how to optimize it?

You may have thought that the number of words in the query we query each time is not too much, usually only a few, at most a dozen. Can you start from here?

Furthermore, the bag-of-words model does not consider the order of words, but some people want the words to appear in order, or they want the searched words to be closer in the text. In this case, the bag-of-words model is powerless.

In view of these two points, can we do better? Obviously it is possible, please see the next piece of code.


import re

class BOWInvertedIndexEngine(SearchEngineBase):
    def __init__(self):
        super(BOWInvertedIndexEngine, self).__init__()
        self.inverted_index = {
    
    }

    def process_corpus(self, id, text):
        words = self.parse_text_to_words(text)
        for word in words:
            if word not in self.inverted_index:
                self.inverted_index[word] = []
            self.inverted_index[word].append(id)

    def search(self, query):
        query_words = list(self.parse_text_to_words(query))
        query_words_index = list()
        for query_word in query_words:
            query_words_index.append(0)
        
        # 如果某一个查询单词的倒序索引为空，我们就立刻返回
        for query_word in query_words:
            if query_word not in self.inverted_index:
                return []
        
        result = []
        while True:
            
            # 首先，获得当前状态下所有倒序索引的 index
            current_ids = []
            
            for idx, query_word in enumerate(query_words):
                current_index = query_words_index[idx]
                current_inverted_list = self.inverted_index[query_word]
                
                # 已经遍历到了某一个倒序索引的末尾，结束 search
                if current_index >= len(current_inverted_list):
                    return result

                current_ids.append(current_inverted_list[current_index])

            # 然后，如果 current_ids 的所有元素都一样，那么表明这个单词在这个元素对应的文档中都出现了
            if all(x == current_ids[0] for x in current_ids):
                result.append(current_ids[0])
                query_words_index = [x + 1 for x in query_words_index]
                continue
            
            # 如果不是，我们就把最小的元素加一
            min_val = min(current_ids)
            min_val_pos = current_ids.index(min_val)
            query_words_index[min_val_pos] += 1

    @staticmethod
    def parse_text_to_words(text):
        # 使用正则表达式去除标点符号和换行符
        text = re.sub(r'[^\w ]', ' ', text)
        # 转为小写
        text = text.lower()
        # 生成所有单词的列表
        word_list = text.split(' ')
        # 去除空白单词
        word_list = filter(None, word_list)
        # 返回单词的 set
        return set(word_list)

search_engine = BOWInvertedIndexEngine()
main(search_engine)


########## 输出 ##########


little
found 2 result(s):
1.txt
2.txt
little vicious
found 1 result(s):
2.txt

First of all, I want to emphasize that this time the algorithm does not require you to fully understand, the implementation here has some knowledge points beyond this chapter. But I hope you don't back down because of this. This example will show you how object-oriented programming isolates the complexity of the algorithm while leaving the interface and other code unchanged.

Let's look at this code next. As you can see, the new model continues to use the previous interface and still only modifies the three functions of init (), process_corpus(), and search().

This is actually a way of teamwork in large companies. After a reasonable hierarchical design, the logic of each layer only needs to deal with its own affairs. When we iteratively upgraded our search engine kernel, the main function and user interface remained unchanged. Of course, if the company recruits new front-end engineers and needs to modify the user interface part, the newcomers don't need to worry too much about back-end things, just do a good job of data interaction.

Continuing to look at the code, you may have noticed the Inverted Index at the beginning. Inverted Index Model, or inverted index, is a very well-known search engine method. Next, I will briefly introduce it.

The reverse index, as the name suggests, means that this time, in turn, we keep the word -> id dictionary. So the situation suddenly became clear. When searching, we only need to extract several reverse indexes of the query_word we want separately, and then find the common elements from these lists. Those common elements, namely ID, are us. The desired query result. In this way, we avoid the embarrassment of going through all the indexes.

process_corpus creates a reverse index. Note that the code here is very streamlined. In the industrial field, a unique ID generator is needed to mark each article with a different ID, and the reverse index should also be sorted according to this unique_id.

As for the search() function, you probably know what it does. It will get all the reverse index according to query_words. If it cannot get it, it means that some query word does not exist in any article, and it will directly return empty; after getting it, it will run an algorithm of "merge K ordered arrays", Get the ID we want from it and return it.

Note that the algorithm used here is not optimal, and the optimal writing requires the smallest heap to store the index.

The problem of traversal is solved. The second question is, what if we want to realize the search words appear in order, or if we want the search words to be closer in the text?

We need to keep the position information of the word for each article on the Inverted Index, so that it can be processed during the merge operation.

LRU and multiple inheritance

At this point, finally, your search engine is online, with more and more visits (QPS). While happy and proud, you find that the server is a bit "overwhelmed". After a period of research, you found that a large number of repetitive searches accounted for more than 90% of the traffic, so you thought of a big killer-add a cache to the search engine.

So, in this last part, I will talk about caching and multiple inheritance.


import pylru

class LRUCache(object):
    def __init__(self, size=32):
        self.cache = pylru.lrucache(size)
    
    def has(self, key):
        return key in self.cache
    
    def get(self, key):
        return self.cache[key]
    
    def set(self, key, value):
        self.cache[key] = value

class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):
    def __init__(self):
        super(BOWInvertedIndexEngineWithCache, self).__init__()
        LRUCache.__init__(self)
    
    def search(self, query):
        if self.has(query):
            print('cache hit!')
            return self.get(query)
        
        result = super(BOWInvertedIndexEngineWithCache, self).search(query)
        self.set(query, result)
        
        return result

search_engine = BOWInvertedIndexEngineWithCache()
main(search_engine)


########## 输出 ##########


little
found 2 result(s):
1.txt
2.txt
little
cache hit!
found 2 result(s):
1.txt
2.txt

Its code is very simple, LRUCache defines a cache class, you can call its methods by inheriting this class. The LRU cache is a very classic cache (at the same time, the implementation of LRU is also an algorithm interview question often tested by Silicon Valley manufacturers. For simplicity, I use the pylru package directly). It conforms to the principle of locality in nature and can be used recently. Objects that have been used, and objects that have not been used for a long time are gradually eliminated.

Therefore, the cache here is also very simple to use. Call the has() function to determine whether it is in the cache. If it is, call the get function to return the result directly; if it is not, send the background calculation result, and then stuff it into the cache.

We can see that the BOWInvertedIndexEngineWithCache class has multiple inheritance of two classes. First of all, what you need to pay attention to is the constructor (have you thought about the question in the last lesson?). There are two initialization methods for multiple inheritance, let's look at them separately.

In the first method, use the following line of code to directly initialize the first parent class of the class:


super(BOWInvertedIndexEngineWithCache, self).__init__()

However, when using this method, it is required that the top-most parent class of the inheritance chain must inherit object.

The second method, for multiple inheritance, if there are multiple constructors to call, we must use the traditional method LRUCache. init (self).

Secondly, you should note that the search() function is again overloaded by the subclass BOWInvertedIndexEngineWithCache, but I still need to call the search() function of BOWInvertedIndexEngine. What should I do? Please look at the following line of code:


super(BOWInvertedIndexEngineWithCache, self).search(query)

We can forcibly call the function of the overridden parent class.

In this way, we have implemented caching concisely without affecting the code of BOWInvertedIndexEngine.