混合搜索与多重嵌入：一次有趣又毛茸茸的猫咪搜索之旅！（二）

这是继上一篇文章 “混合搜索与多重嵌入：一次有趣又毛茸茸的猫咪搜索之旅！（一）” 的续篇。这这篇文章中，我们讲使用本地 Elasticsearch 部署来完成整个演示。这是一个简单的 Python Web 应用程序，展示了可以在 Elastic 中实现的不同类型的搜索：

词汇搜索
文本和图像嵌入的向量搜索
结合词汇和向量搜索的混合搜索

在本展示中，除了使用现有的代码，我们还讲探讨使用 semantic_text 字段来完成整个演示。

准备工作

安装

如果你还没有安装好自己的 Elasticsearch 及 Kibana，请参考如下的文章来进行安装：

在安装的时候，我们可以选择 Elastic Stack 8.x 的安装指南来进行安装。在本博文中，我将使用最新的 Elastic Stack 8.10.4 来进行展示。

在安装 Elasticsearch 的过程中，我们需要记下如下的信息：

扫描二维码关注公众号，回复： 17529748 查看本文章

我们记下上面的信息。它们将在如下的配置中进行使用。

为了能够使得 RRF多路招行排名能够运行，我们必须使用订阅功能：

克隆代码

我们使用如下的命令来下载代码：

git clone https://github.com/jospdeleon/elasticats

我们必须安装好 Python，并在代码的根目录下打入如下的命令：

$ pwd
/Users/liuxg/python/elasticats
$ python --version
Python 3.11.8
$ python -m venv .venv
$ ls -al
total 264
drwxr-xr-x   16 liuxg  staff    512 Nov  1 17:16 .
drwxr-xr-x@ 141 liuxg  staff   4512 Nov  1 17:02 ..
-rw-r--r--    1 liuxg  staff    329 Nov  1 17:02 .env-template
-rw-r--r--    1 liuxg  staff     34 Nov  1 17:02 .flaskenv
drwxr-xr-x   12 liuxg  staff    384 Nov  1 17:02 .git
-rw-r--r--    1 liuxg  staff   3192 Nov  1 17:02 .gitignore
drwxr-xr-x    6 liuxg  staff    192 Nov  1 17:16 .venv
-rw-r--r--    1 liuxg  staff   3135 Nov  1 17:02 README.md
-rw-r--r--    1 liuxg  staff  70441 Nov  1 17:02 Search flowchart.png
-rw-r--r--    1 liuxg  staff   5466 Nov  1 17:02 app.py
-rw-r--r--    1 liuxg  staff  17450 Nov  1 17:02 data.json
-rw-r--r--    1 liuxg  staff    121 Nov  1 17:02 notes.txt
-rw-r--r--    1 liuxg  staff    667 Nov  1 17:02 requirements.txt
-rw-r--r--    1 liuxg  staff   4561 Nov  1 17:02 search.py
drwxr-xr-x    5 liuxg  staff    160 Nov  1 17:02 static
drwxr-xr-x    5 liuxg  staff    160 Nov  1 17:02 templates

我们接着使用如下的命令来进行安装：

$ source .venv/bin/activate
(.venv) $ pip3 install -r requirements.txt

拷贝证书

我们把 Elasticsearch 的证书拷贝到当前的目录下：

$ pwd
/Users/liuxg/python/elasticats
$ cp ~/elastic/elasticsearch-8.15.3/config/certs/http_ca.crt .
$ ls -w
README.md            app.py               http_ca.crt          requirements.txt     static
Search flowchart.png data.json            notes.txt            search.py            templates

从上面我们可以看出来 http_ca.crt 已经被拷贝到当前的目录下。

修改文件

我们把上面的 .env-template 文件拷贝到 .env 文件中，并对它进行相应的修改：

(.venv) $ cp .env-template .env

我们使用一个我们喜欢的编辑器对 .env 进行编辑。它的内容如下：

# Make a copy of this file with the name .env and assign values to variables

# Your Elastic Cloud credentials
export ES_USER="elastic"
export ES_PASSWORD="DgmQkuRWG5RQcodxwGxH"
export ES_ENDPOINT="localhost"
export OPENAI_API_KEY="YOUR_OPEN_AI_KEY"

# The name of the Elasticsearch index, you can change this
ES_INDEX=my-cats

你需要根据自己的配置进行相应的修改。

在原来的文件中，它使用的是 Elastic Clould 来进行完成的。在我们的演示中，我们将使用本地 Elasitcsearch 部署来完成。我们需要修改文件 search.py

search.py

        # self.es = Elasticsearch(cloud_id=os.environ['ELASTIC_CLOUD_ID'],
        #                         api_key=os.environ['ELASTIC_API_KEY'])
        
        elastic_user=os.getenv('ES_USER')
        elastic_password=os.getenv('ES_PASSWORD')
        elastic_endpoint=os.getenv("ES_ENDPOINT")
        
        url = f"https://{elastic_user}:{elastic_password}@{elastic_endpoint}:9200"
        self.es = Elasticsearch(url, ca_certs = "./http_ca.crt", verify_certs = True)

如上所示，我们把 ELASTIC CLOUD 部分的代码注释掉，然后我们替换为自己的本地部署。

由于我使用的是 Python 3.11 版本，我还特意修改了如下的两行代码：

app.py

113 行

 print(f"Total results: {results['hits']['total']['value']}")

106 行

print (f"Search query: {search_params['query']}")

在原始仓库里的代码如下：

print(f'Search query: {search_params['query']}')
print(f'Total results: {results['hits']['total']['value']}')

写入数据到 Elasticsearch 中

在运行应用程序之前，你需要先索引 data.json 中的文档。在 data.json 中的文档类似如这样的数据：

  {
    "cat_id": "70417071",
    "name": "Luke & ( Leia)felv+",
    "url":"https://www.petfinder.com/cat/luke-leiafelv-70417071/va/herndon/fancy-cats-rescue-team-va145/",
    "summary": "Hello, I'm Luke Skywalker, your future feline companion. My tale is a magical one. I was just a regular cat, but one night, under the full moon, I discovered I could speak human language. Startled, I ran away, finding myself here. I'm curious, smart, sweet, and friendly, not to mention a bit goofy and brave. My best friend, Princess Leia, is here too. We're a playful, cuddly, energetic duo who love adventures. I promise to fill your life with purrs, laughter, and endless love. I may not be a Jedi, but I can surely be the hero of your heart.",
    "age": "Adult",
    "gender": "Male",
    "size": "Medium",
    "coat":"Short",
    "breed":"Abyssinian",
    "photo":"images/Abyssinian/70417071.jpeg"
  }

我们在 .venv 环境中运行如下的命令：

flask reindex

(.venv) $ flask reindex
modules.json: 100%|████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 150kB/s]
config_sentence_transformers.json: 100%|███████████████████████████████████| 116/116 [00:00<00:00, 648kB/s]
README.md: 100%|██████████████████████████████████████████████████████| 1.91k/1.91k [00:00<00:00, 10.2MB/s]
0_CLIPModel/special_tokens_map.json: 100%|████████████████████████████████| 389/389 [00:00<00:00, 1.26MB/s]
0_CLIPModel/tokenizer_config.json: 100%|██████████████████████████████████| 604/604 [00:00<00:00, 1.68MB/s]
0_CLIPModel/preprocessor_config.json: 100%|███████████████████████████████| 316/316 [00:00<00:00, 4.95MB/s]
0_CLIPModel/config.json: 100%|█████████████████████████████████████████| 4.03k/4.03k [00:00<00:00, 106MB/s]
0_CLIPModel/merges.txt: 100%|████████████████████████████████████████████| 525k/525k [00:00<00:00, 616kB/s]
0_CLIPModel/vocab.json: 100%|████████████████████████████████████████████| 961k/961k [00:01<00:00, 928kB/s]
pytorch_model.bin: 100%|████████████████████████████████████████████████| 605M/605M [00:35<00:00, 17.0MB/s]
Connected to Elasticsearch!
Traceback (most recent call last):██████████████████████████████████████| 605M/605M [00:35<00:00, 17.4MB/s]

如上所示，我们可以看到有 15 个文档写入到 Elasticsearch 中。我们可以在 Kibana 里进行查看：

GET my-cats/_mapping

上面的命令显示：

{
  "my-cats": {
    "mappings": {
      "properties": {
        "age": {
          "type": "keyword"
        },
        "breed": {
          "type": "keyword"
        },
        "cat_id": {
          "type": "keyword"
        },
        "coat": {
          "type": "keyword"
        },
        "gender": {
          "type": "keyword"
        },
        "img_embedding": {
          "type": "dense_vector",
          "dims": 512,
          "index": true,
          "similarity": "cosine",
          "index_options": {
            "type": "int8_hnsw",
            "m": 16,
            "ef_construction": 100
          }
        },
        "name": {
          "type": "text"
        },
        "photo": {
          "type": "keyword"
        },
        "size": {
          "type": "keyword"
        },
        "summary": {
          "type": "text"
        },
        "summary_embedding": {
          "type": "dense_vector",
          "dims": 384,
          "index": true,
          "similarity": "cosine",
          "index_options": {
            "type": "int8_hnsw",
            "m": 16,
            "ef_construction": 100
          }
        },
        "url": {
          "type": "keyword"
        }
      }
    }
  }
}

从上面我们可以看出来，其中有两个字段是 dense_vector 类型的字段：img_embedding 及 summary_embedding。

GET my-cats/_count

GET my-cats/_search

运行应用

我们现在可以运行并测试该应用程序：

(.venv) $> flask run

(.venv) $ flask run
Connected to Elasticsearch!
{'cluster_name': 'elasticsearch',
 'cluster_uuid': 'WH5NAJ8DRxO39VVTv6caLQ',
 'name': 'liuxgm.local',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-10-09T22:08:00.328917561Z',
             'build_flavor': 'default',
             'build_hash': 'f97532e680b555c3a05e73a74c28afb666923018',
             'build_snapshot': False,
             'build_type': 'tar',
             'lucene_version': '9.11.1',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.15.3'}}
 * Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
 * Restarting with stat
Connected to Elasticsearch!
{'cluster_name': 'elasticsearch',
 'cluster_uuid': 'WH5NAJ8DRxO39VVTv6caLQ',
 'name': 'liuxgm.local',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-10-09T22:08:00.328917561Z',
             'build_flavor': 'default',
             'build_hash': 'f97532e680b555c3a05e73a74c28afb666923018',
             'build_snapshot': False,
             'build_type': 'tar',
             'lucene_version': '9.11.1',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.15.3'}}
 * Debugger is active!
 * Debugger PIN: 682-962-783

从上面，我们可以看到服务器运行于 http://127.0.0.1:5000。我们在浏览器中进行访问：

首先，我们不用选任何的选项，直接点击 Submit 按钮。我们可以看到我们搜索到 15 个结果。这个是我们所有猫。

接下来，你可以选择任何过滤器，将它们与描述字段中的任何文本组合，或上传猫的类似图像。注意：目前，存在分页问题，在进行后续搜索（搜索后）时，结果不会从第一页开始。作为解决方法，请对你想要测试的每次搜索使用 “Reset” 按钮。

在上面，我们选择了

Persian： 72378135_2 (Garth) - multiple cats in the pic

它显示有多个图片被匹配了。