Vector Search with ClickHouse - Part 2

picture

Number of words in this article: 30099; estimated reading time: 76 minutes

Reviewer: Zhuang Xiaodong (Weizhuang)

This article was first published on the public account [ClickHouseInc]

picture

introduce

This blog post is the continuation of a series on vector search. Building on the previous post, we provide an overview of vector search, exploring its relationship to historical inverted index-based approaches, possible use cases that are currently valuable, and Some advanced implementation methods. In this article, we will explore the relationship between ClickHouse and vector search in detail through practical examples, and answer "When should you use ClickHouse for vector search?" "question.

In our example, we used a ClickHouse Cloud cluster with 60-core CPU and 240GB of RAM per node.

When should I use ClickHouse for vector searches?

ClickHouse is a real-time OLAP database that supports full SQL and provides a range of functions to help users write analytical queries. Some of these functions and data structures perform distance operations between vectors, allowing ClickHouse to be used as a vector database.

Thanks to a fully parallelized query pipeline, ClickHouse can process vector search operations very quickly, especially when performing an exact match by linearly scanning all rows, providing processing speeds comparable to those of a dedicated vector database.

High compression levels, tuned through custom compression codecs, make it possible to store and query very large data sets. ClickHouse is not memory limited and allows querying of multi-terabyte datasets containing embedded data.

The ability to calculate the distance between two vectors is just another SQL function, and can be effectively combined with the more traditional SQL filtering and aggregation capabilities. This allows vectors to be stored and queried alongside metadata and even rich text, supporting a wide variety of use cases and applications.

Finally, ClickHouse’s experimental features, such as approximate nearest neighbor (ANN) indexing, support faster approximate vector matching and provide a promising development that is expected to further enhance ClickHouse’s vector matching capabilities.

In short, ClickHouse is an effective vector search platform when one of the following scenarios is met:

1. You want to combine vector matching with metadata filtering and/or aggregation or join capabilities.

2. You need to perform linear distance matching on a very large vector dataset and want to parallelize and distribute this work across many CPU cores without any additional work or configuration.

3. You need to match a vector data set of a size where relying solely on in-memory indexing is not feasible due to cost or hardware availability.

4. You will benefit from full SQL support when querying vectors.

5. You already have an embedded generation pipeline that generates vectors, and do not need this capability natively supported by the storage engine.

6. You already have the relevant data in ClickHouse and don’t want to incur the overhead and cost of learning another tool for millions of vectors.

7. You mainly need fast parallelization to exactly match your vectors, and do not need a production implementation of ANN.

8. You are an experienced or curious ClickHouse user and trust us to improve our vector matching capabilities and want to be a part of this journey.

While this covers a wide range of use cases, in some cases ClickHouse may not be suitable as a vector storage engine and you may wish to consider other alternatives such as Faiss, or a dedicated vector database. ClickHouse may not be suitable if there are the following scenarios:

1. Your vector data set is small and easily fits in memory. While ClickHouse can easily accomplish vector searches of small data sets, it may be more powerful than it actually needs to be.

2. You don't have extra metadata for vectors, just distance matching and sorting. If there is no need to join vector search results with other metadata, and your dataset is small, then as mentioned above, ClickHouse may be more powerful than you really need.

3. You have a very high QPS, greater than a few thousand per second. Typically, for these use cases, the dataset will fit in memory and require a few milliseconds of matching time. While ClickHouse can serve these use cases, a simple in-memory index may be sufficient.

4. You need a solution that includes embedding generation capabilities, where the model is integrated at both insertion and query time. Vector databases, such as Weaviate, are specifically designed for this use case and may be better suited for these needs.

With this in mind, let’s explore ClickHouse’s vector capabilities.

Set up an example

LAION dataset

As we discussed in a previous article, vector search operates on embeddings - vectors that represent contextual meaning. Embeddings are generated by passing raw content, such as images or text, through a pre-trained machine learning model.

For this article, we used a pre-prepared embedding called the LAION 5 billion test set, which is publicly downloadable. We chose this dataset because we believe it is the largest precomputed embedding dataset available for testing at the time of writing this article. It consists of embeddings of billions of public images and their captions on the public internet, generated by publicly crawling the internet. For testing vector searches at scale, it also includes metadata that helps illustrate how ClickHouse's general analytics capabilities can be combined with vector searches.

In the LAION dataset, an embedding is generated for each image and its associated caption - this gives us two embeddings per object. For this article, we only focus on the English subset, which contains 2.2 billion objects. Although these objects each have two embeddings, one for the image and one for the title, we store each pair as a separate row in ClickHouse, which gives us a total of about 2.2 billion rows, 4.4 billion vectors. For each row, we include metadata as columns, which captures information such as image dimensions, similarity of image and title embeddings. This similarity, a cosine distance, allows us to identify objects for which the title and image are not conceptually aligned, which may be filtered out in the query.

We wish to acknowledge the original authors' efforts in curating this dataset and generating embeddings for public use. We recommend reading the complete process for generating this dataset, which overcomes some challenging data engineering challenges, such as efficiently downloading and resizing billions of images in a reasonable amount of time and at an acceptable cost.

Generate embeddings using CLIP model

These LAION embeddings were generated using the ViT-L/14 model, which was trained by LAION using openCLIP, an open source implementation of the CLIP model developed by OpenAI. This is not a cheap process! For 4 million images, this took about 30 days and required 592 V100 GPUs.

CLIP (Contrastive Language-Image Pre-training) is a multi-modal model, which means it is designed to train on multiple related types of data, such as images and associated text. CLIP has good results in OCR, geolocation and action recognition. For image encoding, the authors of CLIP used Resnet50 and the Visual Transformer (ViT), and for text encoding, a transformer similar to GPT-2 was used. The resulting embeddings are represented as two independent sets of vectors.

The key result of the training process is that the embeddings of the two data types are comparable - if the vectors for the image and caption are close, then they can be considered conceptually similar. A good model, such as CLIP, will produce embeddings that are close in distance, or with a cosine similarity value close to 1, for an image vector and its associated caption vector. This is illustrated in the image below, where T1 is the embedded representation of the title of the first image and I1 is the encoding of the image itself. This means that we want to maximize the diagonal of this matrix during training when our image and text coincide.

As a post-processing step, the authors discarded images with a cosine similarity less than 0.28 to the text caption, thereby filtering out possible poor-quality results where the caption and image were not aligned. Further removal of image size, title length, possible illegality, and duplicates reduced the total dataset from over 5 billion to 2.2 billion.

picture

Image source: https://openai.com/research/clip

Prepare data for loading

The LAION dataset can be downloaded from multiple sources. After selecting the English subset, we used the version hosted by Hugging Face. This service relies on Git Large File Storage (LFS), which requires a client to be installed in order to download files. Once installed, downloading data only requires one command. To do this, make sure you have at least 20TB of free disk space.

git lfs install
git clone https://huggingface.co/datasets/laion/laion2b-en-vit-l-14-embeddings

The download includes three folders; two of them contain the embedded npy format (actually a multidimensional array format) of the images and captions, and the third directory contains Parquet files with metadata for each image and caption pair.

ubuntu@ip-172-31-2-70:/data$ ls -l ./laion2b-en-vit-l-14-embeddings
total 456
drwxrwxr-x 2 ubuntu ubuntu  77824 May 16 12:28 img_emb
drwxrwxr-x 2 ubuntu ubuntu 110592 May 16 12:27 metadata
drwxrwxr-x 2 ubuntu ubuntu 270336 May 16 12:28 text_emb

In order to load this data into ClickHouse, we want to generate a separate row for each embedded pair and add metadata. This requires a process of merging the corresponding embedding and metadata for each object. Considering that vectors in ClickHouse can be represented as arrays of floats, the result of this process might be a JSON line like the following:

{
 "key": "196060024",
 "url": "https://cdn.shopify.com/s/files/1/1194/1070/products/[email protected]?v=1477414012",
 "caption": "MERCEDES BENZ G65 RIDE-ON TOY CAR WITH PARENTAL REMOTE |  CHERRY",
 "similarity": 0.33110910654067993,
 "width": "220",
 "height": "147",
 "original_width": "220",
 "original_height": "147",
 "status": "success",
 "NSFW": "UNLIKELY",
 "exif": {
   "Image Orientation": "Horizontal (normal)",
   "Image XResolution": "72",
   "Image YResolution": "72",
   "Image ResolutionUnit": "Pixels/Inch",
   "Image YCbCrPositioning": "Centered",
   "Image ExifOffset": "102",
   "EXIF ExifVersion": "0210",
   "EXIF ComponentsConfiguration": "YCbCr",
   "EXIF FlashPixVersion": "0100",
   "EXIF ColorSpace": "Uncalibrated",
   "EXIF ExifImageWidth": "220",
   "EXIF ExifImageLength": "147"
 },
 "text_embedding": [
   0.025299072265625,
   ...
   -0.031829833984375
 ],
 "image_embedding": [
   0.0302276611328125,
   ...
   -0.00667572021484375
 ]
}

The complete code for preprocessing the dataset can be found here. The final 2313 Parquet files generated by this process took up approximately 5.9TB of disk space. We combined this data to generate a 6TB Parquet dataset that users can download and use to reproduce the examples.

Storing vectors in ClickHouse

Loading the generated Parquet files into ClickHouse requires a few simple steps.

Mode and loading process

The following shows our table schema, where the embeddings are stored as Array(Float32)  columns.

CREATE TABLE laion
(
  `_file` LowCardinality(String),
  `key` String,
  `url` String,
  `caption` String,
  `similarity` Float64,
  `width` Int64,
  `height` Int64,
  `original_width` Int64,
  `original_height` Int64,
  `status` LowCardinality(String),
  `NSFW` LowCardinality(String),
  `exif` Map(String, String),
  `text_embedding` Array(Float32),
  `image_embedding` Array(Float32),
  `orientation` String DEFAULT exif['Image Orientation'],
  `software` String DEFAULT exif['Image Software'],
  `copyright` String DEFAULT exif['Image Copyright'],
  `image_make` String DEFAULT exif['Image Make'],
  `image_model` String DEFAULT exif['Image Model']
)
ENGINE = MergeTree
ORDER BY (height, width, similarity)

The exif column contains metadata that we can later use for filtering and aggregation. We map it as Map(String,String)  for flexibility and concise schema. This column contains over 100,000 unique meta tags. Accessing subkeys requires loading all keys from the column, which may slow down some queries, so we use the DEFAULT syntax to separate the five properties of interest Extract to root directory for later analysis. For users interested in all available meta-attributes, the following query can be used to identify available Map keys and their frequencies:

SELECT
  arrayJoin(mapKeys(exif)) AS keys,
  count() AS c
FROM laion
GROUP BY keys
ORDER BY c DESC
LIMIT 10

Our schema also includes a _file column, representing the original Parquet file that generated this data. This allows us to restart the loading of specific files during insertion into ClickHouse.

For future use, we load this data into a public S3 bucket. To insert this data into ClickHouse, the user can execute the following query:

INSERT INTO laion SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/laion/*.parquet')

This is a lot of data to load, and an unoptimized load may take several hours. We recommend users to batch the loading process to avoid interruptions such as network connectivity issues. Users can use glob patterns to target specific subsets, such as s3(https://datasets-documentation.s3.eu-west-3.amazonaws.com/laion/00*.parquet)< /span> column can be used to reconcile any loading issues by confirming the count in ClickHouse against the count in the original Parquet file. _file. The

For the following examples, we create tables of various sizes, with the suffix indicating the number of rows; for example, laion_100m contains 100 million rows. Create these tables using the appropriate glob schema.

INSERT INTO laion_sample (_file, key, url, caption, similarity, width, height, original_width, original_height, status, NSFW, exif, text_embedding, image_embedding) SELECT
    _file,
    key,
    url,
    caption,
    similarity,
    width,
    height,
    original_width,
    original_height,
    status,
    NSFW,
    exif,
    text_embedding,
    image_embedding
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/laion/*.parquet')

Storage performance and compression

ClickHouse's column-oriented structure means that column values ​​are sorted and written sequentially. Clustering of identical and similar values ​​on disk often results in high compression ratios. ClickHouse even offers several modes and codecs to allow users to adjust their configuration based on the properties of their data. For floating point arrays, it is difficult to achieve high compression since the embedded values ​​have no domain-agnostic properties to exploit. The full 32-bit range is used, and for most codecs the relationship between adjacent values ​​in the embedding is random. Therefore, we recommend using the ZSTD codec to compress embeddings. Below we show the compression ratio of vector columns in four tables of increasing size: 1m, 10m, 100m and 2b rows.

SELECT
  table,
  name,
  formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
  formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
  round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE (table IN ('laion_100m', 'laion_1m', 'laion_10m', 'laion_2b')) AND (name IN ('text_embedding', 'image_embedding'))
GROUP BY
  table,
  name
ORDER BY table DESC

┌─table──────┬─name────────────┬─compressed_size─┬─uncompressed_size─┬─ratio─┐
│ laion_1m   │ text_embedding  │ 1.60 GiB      │ 2.50 GiB         │  1.56 │
│ laion_1m   │ image_embedding │ 1.61 GiB      │ 2.50 GiB         │  1.55 │
│ laion_10m  │ text_embedding  │ 18.36 GiB     │ 28.59 GiB        │  1.56 │
│ laion_10m  │ image_embedding │ 18.36 GiB     │ 28.59 GiB        │  1.56 │
│ laion_100m │ text_embedding  │ 181.64 GiB    │ 286.43 GiB       │  1.58 │
│ laion_100m │ image_embedding │ 182.29 GiB    │ 286.43 GiB       │  1.57 │
│ laion_1b   │ image_embedding │ 1.81 TiB      │ 2.81 TiB         │  1.55 │
│ laion_1b   │ text_embedding  │ 1.81 TiB      │ 2.81 TiB         │  1.55 │
└────────────┴─────────────────┴─────────────────┴───────────────────┴───────┘

6 rows in set. Elapsed: 0.006 sec.

Although compression ratios can often be affected by primary key selection, this constant compression ratio of 1.56 is unlikely to be affected by how the data is sorted. The compression level of the ZSTD codec can be increased from the default value of 1 in ClickHouse Cloud. This provides approximately a 10% improvement in compressing our data by 1.71 on a 10 million row sample.

SELECT
  table,
  name,
  formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
  formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
  round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE (table IN ('laion_10m_zstd_3')) AND (name IN ('text_embedding', 'image_embedding'))
GROUP BY
  table,
  name
ORDER BY table DESC

┌─table────────────┬─name────────────┬─compressed_size─┬─uncompressed_size─┬─ratio─┐
│ laion_10m_zstd_3 │ text_embedding  │ 16.68 GiB        │ 28.56 GiB          │  1.71 │
│ laion_10m_zstd_3 │ image_embedding │ 16.72 GiB        │ 28.56 GiB          │  1.71 │
└──────────────────┴─────────────────┴─────────────────┴───────────────────┴───────┘

2 rows in set. Elapsed: 0.026 sec.

For higher values ​​of ZSTD, compression and data insertion will slow down, although decompression speed should remain relatively constant (about a 20% difference).

Compression of floating point numbers is an area of ​​research, and there are several quantization-based lossy candidate algorithms such as the SZ algorithm that may be added to ClickHouse. Other options include reducing the precision of floating point numbers to 16 bits. We discuss this issue in the "Improving Compression" section below.

Search for vectors in ClickHouse

As we introduced in Part 1 of this series, performing a vector search means comparing the input vector to a library of vectors to find the closest match.

Input vectors represent concepts of interest. In our case this is either an encoded image or a title. The vector library represents other images we wish to compare with and their captions.

When searching, vectors are compared to determine proximity or distance. Two vectors that are close in distance represent similar concepts. In a set, the two closest vectors are the most conceptually similar.

Choose a distance function

Given the high dimensionality of vectors, there are many ways to compare distances. These different mechanisms are called distance functions.

ClickHouse supports a variety of distance functions - you can choose the one that works best for you based on your use case. In this article, we focus on two that are very commonly used in vector searches:

  • Cosine Distance - cosineDistance(vector1, vector2) - This gives us the sum of the two vectors cosine similarity between. More specifically, this measures the cosine of the angle between two vectors, which is the dot product divided by the length. This produces a value between -1 and 1, where 1 means that the two embeddings are proportional and therefore conceptually identical. Column names and input embeddings can be parsed for vector searches. This function is especially relevant if the vectors are not normalized, and also provides a bounded range, useful for filtering.

  • L2 Distance - L2Distance(vector1, vector2) - This measures the distance between 2 points L2 distance. Effectively, this is the Euclidean distance between two input vectors, i.e. the length of the line between the points represented by the vectors. The shorter the distance, the more conceptually similar the source objects are.

Both functions calculate a score used to compare vector embeddings. For our pre-trained CLIP model, the L2 distance represents the most appropriate distance function based on the internal scoring used by the official examples.

To see the complete list of available distance and vector normalization functions, look here. We’d love to hear how you leverage these to search your embeds!

Generate input vector

Now that we have determined the distance function to use, we need to transform the input (the image or caption we want to search for) into a vector embedding.

This requires us to call the CLIP model. This is easy to achieve with a simple Python script. Installation instructions for the dependencies required by this script can be found here. We show this script below:

#!/usr/bin/python3
import argparse
from PIL import Image
import clip
import torch

if __name__ == '__main__':
  parser = argparse.ArgumentParser(
      prog='generate',
      description='Generate CLIP embeddings for images or text')
  group = parser.add_mutually_exclusive_group(required=True)
  group.add_argument('--text', required=False)
  group.add_argument('--image', required=False)
  parser.add_argument('--limit', default=1)
  parser.add_argument('--table', default='laion_1m')
  args = parser.parse_args()
  device = "cuda" if torch.cuda.is_available() else "cpu"
  print(f"using {device}")
  device = torch.device(device)
  model, preprocess = clip.load("ViT-L/14")
  model.to(device)
  images = []
  if args.text:
      inputs = clip.tokenize(args.text)
      with torch.no_grad():
          print(model.encode_text(inputs)[0].tolist())
  elif args.image:
      image = preprocess(Image.open(args.image)).unsqueeze(0).to(device)
      with torch.no_grad():
          print(model.encode_image(image)[0].tolist())

This version of the script accepts text or image paths as input and outputs the embedding results to the command line. Note that if a CUDA-compatible GPU is present, this script will take advantage of it. This can drastically reduce build times - when testing on a 2021 Mac M1, build times for 100 titles were around 6 seconds compared to 1 second on a p3.2xlarge with 1 GPU core.

As an example, let's convert the text "A Drowsy Ridgeback" into an embed. For the sake of brevity, we've captured the full embedding results, which you can find here.

python generate.py --text "a sleepy ridgeback dog"

[0.5736801028251648, 0.2516217529773712, ...,  -0.6825592517852783]

We now have a vector embedding representing the text "A Drowsy Ridgeback". This is our search input vector. We can now compare this input vector to our vector embedding library and find images and their captions that represent conceptually similar things.

Integrate everything

The following query searches for conceptually similar embeddings, ordered by distance. Embeddings are stored in the image_embedding column. The distance is stored as similarity . We filter out any results with a distance greater than 0.2 to reduce noise.

SELECT
  url,
  caption,
  L2Distance(image_embedding, [0.5736801028251648, 0.2516217529773712, ...,  -0.6825592517852783]) AS score
FROM laion_10m WHERE similarity >= 0.2
ORDER BY score ASC
LIMIT 2
FORMAT Vertical

Row 1:
──────
url:   https://thumb9.shutterstock.com/image-photo/stock-photo-front-view-of-a-cute-little-young-thoroughbred-african-rhodesian-ridgeback-hound-dog-puppy-lying-in-450w-62136922.jpg
caption: Front view of a cute little young thoroughbred African Rhodesian Ridgeback hound dog puppy lying in the woods outdoors and staring.
score:   12.262665434714496

Row 2:
──────
url:   https://m.psecn.photoshelter.com/img-get2/I0000_1Vigovbi4o/fit=180x180/fill=/g=G0000x325fvoXUls/I0000_1Vigovbi4o.jpg
caption: SHOT 1/1/08 3:15:27 PM - Images of Tanner a three year-old male Vizsla sleeping in the sun on the couch in his home in Denver, Co. The Hungarian Vizsla, is a dog breed originating in Hungary. Vizslas are known as excellent hunting dogs, and also have a level personality making them suited for families. The Vizsla is a medium-sized hunting dog of distinguished appearance and bearing. Robust but rather lightly built, they are lean dogs, have defined muscles, and are similar to a Weimaraner but smaller in size. The breed standard calls for the tail to be docked to two-thirds of its original length in smooth Vizslas and to three-fourths in Wirehaired Vizslas..(Photo by Marc Piscotty/ (c) 2007)
score:   12.265194306913513

2 rows in set. Elapsed: 1.595 sec. Processed 9.92 million rows, 32.52 GB (6.22 million rows/s., 20.38 GB/s.)

The results show that our input vector "a sleepy ridgeback" is conceptually closest to the photo of an African ridgeback in the dataset, and is also conceptually very similar to the image of a sleeping hound.

picture

My dog ​​Kibo

To further demonstrate the usefulness of these models, instead of using text, we can choose to use a photo of a sleeping dog as a starting point for the search. We generate an input vector representing this photo and then search for conceptually similar results.

To do this, we repeat the query above, but using the text_embedding column. The full embed can be found here.

python generate.py --image images/ridgeback.jpg

[0.17179889976978302, 0.6171532273292542, ...,  -0.21313616633415222]
SELECT
  url,
  caption,
  L2Distance(text_embedding, [0.17179889976978302, ..., -0.21313616633415222]
) AS score
FROM laion_10m WHERE similarity >= 0.2
ORDER BY score ASC
LIMIT 2
FORMAT Vertical

Row 1:
──────
url:   https://i.pinimg.com/236x/ab/85/4c/ab854cca81a3e19ae231c63f57ed6cfe--submissive--year-olds.jpg
caption: Lenny is a 2 to 3 year old male hound cross, about 25 pounds and much too thin. He has either been neglected or on his own for a while. He is very friendly if a little submissive, he ducked his head and tucked his tail a couple of times when I...
score:   17.903361349936052

Row 2:
──────
url:   https://d1n3ar4lqtlydb.cloudfront.net/c/a/4/2246967.jpg
caption: American Pit Bull Terrier/Rhodesian Ridgeback Mix Dog for adoption in San Clemente, California - MARCUS = Quite A Friendly Guy!
score:   17.90681696075351

2 rows in set. Elapsed: 1.516 sec. Processed 9.92 million rows, 32.52 GB (6.54 million rows/s., 21.45 GB/s.)

For convenience, we provide a simple result generator, search.py, which encodes the incoming image or text and executes the query, then renders the query results to a local html file. This file will then automatically open in your local browser. The result file of the above query is as follows:

python search.py search --image images/ridgeback.jpg --table laion_10m

picture

In both examples, we matched embeddings from different patterns, i.e. the embeddings obtained from the image input with the text_embedding columns match and vice versa. This is consistent with the original model training described previously and is the intended application. While input embeddings matching the same type have been explored, previous attempts have had mixed results.

Advantages of SQL

When doing vector searches in practice, we often do more than just search across embeddings. Often, there is additional utility in combining metadata for searching, filtering, or aggregation.

Filter based on metadata

For example, let's say we want to perform a vector search on non-copyrighted images. This query will be combined with a vector search and filtered based on copyright metadata.

As another example, suppose we only want to search for large images - at least 300px*500px, and the title similarity meets the higher cosine similarity score of 0.3. For this example, we started our search with "Great Animal Migrations." Fortunately, formulating this as a SQL query is very simple. Below is the query we performed for 100 million images.

SELECT
  url,
  caption,
  L2Distance(image_embedding, [<embedding>]) AS score
FROM laion_100m
WHERE (width >= 300) AND (height >= 500) AND (copyright = '') AND similarity > 0.3
ORDER BY score ASC
LIMIT 10
FORMAT Vertical


Row 1:
──────
url:   https://aentcdn.azureedge.net/graphics/items/sdimages/a/500/3/6/5/4/1744563.jpg
caption: Great Migrations
width:   366
height:  500
score:   16.242750635008512

Row 2:
──────
url:   https://naturefamiliesdotorg.files.wordpress.com/2017/01/on-the-move.jpg?w=418&h=557
caption: on-the-move
width:   384
height:  512
score:   16.26983713529263

10 rows in set. Elapsed: 2.010 sec. Processed 6.82 million rows, 22.52 GB (3.39 million rows/s., 11.20 GB/s.)

This query highlights the benefits of using SQL and metadata to limit vector comparisons to a subset. In this particular example, we queried over 100 million vectors, but due to our metadata, the actual distance matches were reduced to less than 7 million.

For convenience, we have also added the ability to pass additional filters in search.py, allowing us to verify the quality of the above matches:

python search.py search --filter "(width >= 300) AND (height >= 500) AND (copyright = '') AND simularity > 0.3" --text "great animal migrations"

picture

Aggregation using metadata

In addition to filtering, we can also perform aggregation operations on metadata. As a column-oriented database, ClickHouse is ideally suited for this task.

For example, let's say we want to determine the main camera model used for "safari pictures". We perform the search here:

WITH results AS
  (
      SELECT
          image_make,
          image_model,
          L2Distance(image_embedding, [<embedding>]) AS score
      FROM laion_100m
      WHERE (image_make != '') AND (image_model != '')
      ORDER BY score ASC
      LIMIT 1000
  )
SELECT
  image_make,
  image_model,
  count() AS c
FROM results
GROUP BY
  image_make,
  image_model
ORDER BY c DESC
LIMIT 10

┌─image_make────────┬─image_model───────────┬──c─┐
│ Canon           │ Canon EOS 7D        │ 64 │
│ Canon           │ Canon EOS-1D X      │ 51 │
│ Canon           │ Canon EOS 5D Mark III │ 49 │
│ NIKON CORPORATION │ NIKON D700          │ 26 │
│ NIKON CORPORATION │ NIKON D800          │ 24 │
│ Canon           │ Canon EOS 5D Mark II  │ 23 │
│ NIKON CORPORATION │ NIKON D810          │ 23 │
│ NIKON CORPORATION │ NIKON D7000         │ 21 │
│ Canon           │ Canon EOS 40D       │ 18 │
│ Canon           │ Canon EOS 60D       │ 17 │
└───────────────────┴───────────────────────┴────┘

10 rows in set. Elapsed: 23.897 sec. Processed 100.00 million rows, 286.70 GB (4.18 million rows/s., 12.00 GB/s.)

Clearly, if you want to go on safari next time, Canon should be your camera of choice. Note that here we only use the first 1000 results. Unlike cosine distance, which has no bounds, Euclidean distance has no upper limit, which makes setting the threshold challenging.

Use inverted index

Note: Inverted indexing is an experimental feature in ClickHouse.

ClickHouse's experimental secondary indexing feature can also be very useful when working with vectors.

For example, we might want to implement a filter that limits our safari images to only those containing lions. To do this, we can impose a token restriction that requires the caption column to contain the lions string.

Without the inverted index, our search might look like this. Here we use the embedding of the following image and search against 100 million vectors.

picture

SELECT url, caption, L2Distance(text_embedding, [<embedding>]) AS score FROM laion_10m WHERE SELECT
  url,
  caption,
  L2Distance(text_embedding, [-0.17659325897693634, …, 0.05511629953980446]) AS score
FROM laion_100m
WHERE hasToken(lower(caption), 'lions')
ORDER BY score ASC
LIMIT 10
FORMAT Vertical

Row 1:
──────
url:   https://static.wixstatic.com/media/c571fa_25ec3694e6e04a39a395d07d63ae58fc~mv2.jpg/v1/fill/w_420,h_280,al_c,q_80,usm_0.66_1.00_0.01/Mont%20Blanc.jpg
caption: Travel on a safari to Tanzania, to the rolling plains of the Serengeti, the wildlife-filled caldera of the Ngorongoro Crater and the lions and baobabs of Tarangire; Tanzania will impress you like few other countries will.  This tailor-made luxury safari will take you to three very different parks in northern Tanzania, each with their own scenery and resident wildlife.   As with all our private tours, this sample itinerary can be completely tailored to create the perfect journey of discovery for you.
score:   18.960329963316692

Row 2:
──────
url:   https://thumbs.dreamstime.com/t/jeepsafari-ngorongoro-tourists-photographers-watching-wild-lions-who-walk-jeeps-79635001.jpg
caption: Jeep safari in Ngorongoro3. Tourists and photographers are watching wild lions, who walk between the jeeps Stock Image
score:   18.988379350742093
hasToken(lower(caption), 'lions') ORDER BY score ASC LIMIT 10 FORMAT Vertical

10 rows in set. Elapsed: 6.194 sec. Processed 93.82 million rows, 79.00 GB (15.15 million rows/s., 12.75 GB/s.)

To speed up this kind of metadata query, we can take advantage of the inverted index and add an inverted index for the caption column.

SET allow_experimental_inverted_index=1
ALTER TABLE laion_100m ADD INDEX caption_idx(lower(caption)) TYPE inverted;
ALTER TABLE laion_100m MATERIALIZE INDEX caption_idx;

Repeating our previous query, we can see that this resulted in a significant improvement in query time. The inverted index can be used to limit the number of rows for our distance comparison to 30 million, thereby reducing the time from 6 seconds to 3 seconds.

SELECT url, caption, L2Distance(text_embedding, [<embedding>]) AS score FROM laion_10m WHERE SELECT
  url,
  caption,
  L2Distance(text_embedding, [-0.17659325897693634, ..., 0.05511629953980446]) AS score
FROM laion_100m
WHERE hasToken(lower(caption), 'lions')
ORDER BY score ASC
LIMIT 10
FORMAT Vertical

Row 1:
──────
url:   https://static.wixstatic.com/media/c571fa_25ec3694e6e04a39a395d07d63ae58fc~mv2.jpg/v1/fill/w_420,h_280,al_c,q_80,usm_0.66_1.00_0.01/Mont%20Blanc.jpg
caption: Travel on a safari to Tanzania, to the rolling plains of the Serengeti, the wildlife-filled caldera of the Ngorongoro Crater and the lions and baobabs of Tarangire; Tanzania will impress you like few other countries will.  This tailor-made luxury safari will take you to three very different parks in northern Tanzania, each with their own scenery and resident wildlife.   As with all our private tours, this sample itinerary can be completely tailored to create the perfect journey of discovery for you.
score:   18.960329963316692

Row 2:
──────
url:   https://thumbs.dreamstime.com/t/jeepsafari-ngorongoro-tourists-photographers-watching-wild-lions-who-walk-jeeps-79635001.jpg
caption: Jeep safari in Ngorongoro3. Tourists and photographers are watching wild lions, who walk between the jeeps Stock Image
score:   18.988379350742093

10 rows in set. Elapsed: 3.554 sec. Processed 32.96 million rows, 74.11 GB (9.27 million rows/s., 20.85 GB/s.)

The query results are as follows:

python search.py search --image ./images/safari.jpg --table laion_100m --filter "hasToken(lower(caption), 'lions')"

picture

Advanced Features

Approximate nearest neighbor (Annoy)

Note: Annoy indexing in ClickHouse is still highly experimental.

Annoy indexes are designed to improve the efficiency of large-scale nearest neighbor vector searches. It involves a trade-off between accuracy and computational efficiency.

Specifically, Annoy index is a data structure used to find approximate nearest neighbors in high-dimensional space. Annoy works by organizing vectors into a tree structure. It uses random hyperplanes (lines in 2d space, planes in 3d, etc.) to split high-dimensional space into multiple regions. These hyperplanes divide the space into smaller regions, each containing only a subset of the data points. These partitions are in turn used to build a tree structure (usually binary) where each node represents a hyperplane and the child nodes represent regions that split the plane. The leaf nodes of the tree contain the actual data points. Balancing and optimization techniques, such as random insertion and the use of heuristics to determine the best hyperplane for partitioning, ensure that the tree is efficient and balanced.

Once the Annoy index is built, it can be used for searching. Once the vectors are provided, the tree can be traversed by comparing each vector to the hyperplane of each internal node. At each level of the tree, Annoy estimates the distance between the query vector and the region represented by the child node. The distance measurement determines which child node to explore further. When the root or specified node is reached, returns the set of nodes it encountered. The result is an approximate result set whose search time is likely to be much faster than a linear scan.

picture

Image segmented by Annoy hyperplane

When creating Annoy index for ClickHouse, we can specify NumTree and DistanceName. The latter represents the distance function used, defaulting to L2Distance, which is suitable for our LAION dataset. The former represents the number of trees the algorithm will create. The bigger the tree, the slower it works (in both CREATE and SELECT requests), but the better accuracy you get (taking into account randomness). By default, NumTree is set to 100.

Below, we show the schema of the LAION dataset with Annoy indexes, one index for each embedded field. We populate the table with 100m rows using the default index.

SET allow_experimental_annoy_index = 1

CREATE TABLE default.laion_100m_annoy
(
   `_file` LowCardinality(String),
   `key` String,
   `url` String,
   `caption` String,
   `similarity` Float64,
   `width` Int64,
   `height` Int64,
   `original_width` Int64,
   `original_height` Int64,
   `status` LowCardinality(String),
   `NSFW` LowCardinality(String),
   `exif` Map(String, String),
   `text_embedding` Array(Float32),
   `image_embedding` Array(Float32),
   `orientation` String DEFAULT exif['Image Orientation'],
   `software` String DEFAULT exif['Image Software'],
   `copyright` String DEFAULT exif['Image Copyright'],
   `image_make` String DEFAULT exif['Image Make'],
   `image_model` String DEFAULT exif['Image Model'],
   INDEX annoy_image image_embedding TYPE annoy(1000) GRANULARITY 1000,
   INDEX annoy_text text_embedding TYPE annoy(1000) GRANULARITY 1000
)
ENGINE = MergeTree
ORDER BY (height, width, similarity)

INSERT INTO laion_100m_annoy SELECT * FROM laion_100m

0 rows in set. Elapsed: 1596.941 sec. Processed 100.00 million rows, 663.68 GB (62.62 thousand rows/s., 415.59 MB/s.)

As shown, the overhead of Annoy index is very high when inserting. The above insert takes about 27 minutes to insert 100 million rows of data. Compare this to 10 minutes for the table without these indexes. Below, we repeat the previous query, which takes approximately 24 seconds (hot query).

SELECT
  url,
  caption,
  L2Distance(image_embedding, [embedding]) AS score
FROM laion_100m_annoy
ORDER BY score ASC
LIMIT 10 FORMAT Vertical

Row 1:
──────
url:   https://i.dailymail.co.uk/i/pix/2012/04/26/article-2135380-12C5ADBC000005DC-90_634x213.jpg
caption: Pampered pets: This hammock-style dog bed offers equal levels of pet comfort
score:   12.313203570174357

Row 2:
──────
url:   https://i.pinimg.com/originals/15/c2/11/15c2118a862fcd0c4f9f6c960d2638a0.jpg
caption: rhodesian ridgeback lab mix puppy
score:   12.333195649580162

10 rows in set. Elapsed: 1.456 sec. Processed 115.88 thousand rows, 379.06 MB (79.56 thousand rows/s., 260.27 MB/s.)

The Annoy index brings a significant improvement in query performance, this query takes between 1 and 2 seconds, but at the expense of some search quality.

The test embed here represents our "A Sleepy Ridgeback Dog" text. We can see the image results below.

python search.py search --text "a sleepy ridgeback dog" --table laion_100m_annoy

picture

In ClickHouse, it is important to note that Annoy indexes can be used to speed up queries that either utilize ORDER BY DistanceFunction(Column, vector)  Sort, or use WHERE DistanceFunction(Column, Point) < MaxDistance  to filter, but you cannot use both at the same time. The query must set a LIMIT to return the first N matches. To return previous matches, a priority queue-based buffer is used to collect matching vectors. Once the buffer is full, collection stops and the buffer is sorted. The size of this buffer is limited by the setting max_limit_for_ann_queries (default is 1000000).

User-defined functions (UDFs)

ClickHouse's user-defined functions, or UDFs, allow users to extend ClickHouse's behavior by creating lambda expressions that take advantage of SQL constructs and functions. These functions can then be used like any built-in function in the query.

So far we have relied on generating our vectors outside of ClickHouse and querying them from within our search.py script Pass the generated embedding. While this is enough, it would be better if we could simply pass the text or image path (or even the url!) directly in the SQL query.

We can use UDFs to accomplish this task. The UDFs defined below are called embedText and embedImage respectively.

SELECT
  url,
  caption,
  L2Distance(image_embedding, embedText('a sleepy ridgeback dog')) AS score
FROM laion_10m
ORDER BY score ASC
LIMIT 10

SELECT
  url,
  caption,
  L2Distance(text_embedding, embedImage("https://dogpictures.com/ridgeback.jpg")) as score
FROM laion_100m
ORDER BY score ASC
LIMIT 10

In order to define embedText UDF, we first use the generate.py< that was used to generate the embed. /span> is modified to the following embed_text.py.

Note: This should be saved in ClickHouse's user_scripts folder.

#!/usr/bin/python3
import clip
import torch
import sys

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)

if __name__ == '__main__':
  for text in sys.stdin:
      inputs = clip.tokenize(text)
      with torch.no_grad():
          text_features = []
          text_features = model.encode_text(inputs)[0].tolist()
          print(text_features)
          sys.stdout.flush()

This embed_text.py script can then be exposed via the custom function embedText. The following configuration can be placed under the ClickHouse configuration directory (default is /etc/clickhouse-server/) and named embed_text__function.xml.

Note: Users should ensure that the dependencies of this script are installed for clickhouse users.

<functions>
  <function>
      <type>executable</type>
      <name>embedText</name>
      <return_type>Array(Float32)</return_type>
      <argument>
          <type>String</type>
          <name>text</name>
      </argument>
      <format>TabSeparated</format>
      <command>embed_text.py</command>
      <command_read_timeout>1000000</command_read_timeout>
  </function>
</functions>

After the function is registered, we can now use it like we did in the previous example:

SELECT
  url,
  caption,
  L2Distance(image_embedding, embedText('a sleepy ridgeback dog')) AS score
FROM laion_10m
ORDER BY score ASC
LIMIT 10

For our similar embedImage function, we add another UDF based on the following python script embed_image.py.

#!/usr/bin/python3
from io import BytesIO
from PIL import Image
import requests
import clip
import torch
import sys

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)

if __name__ == '__main__':
  for url in sys.stdin:
      response = requests.get(url.strip())
      response.raise_for_status()
      image = preprocess(Image.open(BytesIO(response.content))).unsqueeze(0).to(device)
      with torch.no_grad():
          print(model.encode_image(image)[0].tolist())
          sys.stdout.flush()
<functions>
  <function>
      <type>executable_pool</type>
      <name>embedImage</name>
      <return_type>Array(Float32)</return_type>
      <argument>
      <type>String</type>
      </argument>
    <format>TabSeparated</format>
      <command>embed_image.py</command>
      <command_read_timeout>1000000</command_read_timeout>
  </function>
</functions>

When the UDF is set to executable_pool type, ClickHouse maintains a pool of preloaded python instances ready to receive input. For our function, this is beneficial because it reduces model loading time after the first execution. This allows subsequent calls to be faster. More details on how to control pool size and other configuration parameters can be found here.

Now that both UDFs are configured, we can perform the following query:

SELECT embedImage('https://cdn.britannica.com/12/236912-050-B39F82AF/Rhodesian-Ridgeback-dog.jpg')
...
1 row in set. Elapsed: 13.421 sec.

SELECT embedImage('https://cdn.britannica.com/12/236912-050-B39F82AF/Rhodesian-Ridgeback-dog.jpg')
...
1 row in set. Elapsed: 0.317 sec.

SELECT
  url,
  caption,
  L2Distance(image_embedding, embedImage('https://cdn.britannica.com/12/236912-050-B39F82AF/Rhodesian-Ridgeback-dog.jpg')) AS score
FROM laion_10m
ORDER BY score ASC
LIMIT 10

Once this is done, we can use the embed_concept.py script and the embedConcept function to expose our previous concept math capabilities.

select embedConcept('(berlin - germany) + (uk + bridge)')

SELECT
  url,
  caption,
  L2Distance(image_embedding, embedConcept('(berlin - germany) + (uk + bridge)')) AS score
FROM laion_10m
ORDER BY score ASC
LIMIT 10

Note that the above example does not include error handling and input validation. We leave this as an exercise to the reader. Hopefully these examples provide some inspiration for combining user-defined functions, embedding models, and vector searches!

Improve compression efficiency

Enhanced compression technology can help increase overall data size and storage requirements. For example, our previous pattern and resulting compression statistics were based on storing our vectors as Array(Float32) type. However, for some models, 32-bit floating point precision is not required and similar match quality can be obtained by reducing to 16-bit.

Although ClickHouse does not have a native 16-bit floating point type, we can still reduce our precision to 16 bits and reuse Float32 type, each value is simply padded with zeros. These zeros will be efficiently compressed by the ZSTD codec (the standard in ClickHouse Cloud), reducing our compressed storage requirements.

In order to achieve this, we need to ensure that the 16-bit floating point value is encoded correctly. Fortunately, Google's bloat16 type is suitable for machine learning use cases and only requires truncating the last 16 bits of a 32-bit floating point number, provided the latter uses IEE-754 encoding.

picture

Source: https://cloud.google.com/tpu/docs/bfloat16

Although bfloat16 is not currently native to ClickHouse, it can be easily replicated with other functions. We do this below for the image_embedding and text_embedding columns.

To do this, select all rows from table laion_100m (containing 100m rows) and use  The INSERT INTO SELECT clause is inserted into the table laion_100m_bfloat16 . During SELECT we convert the value in the embedding to a BFloat16 representation.

This bfloat16 conversion uses a arrayMap function, namely arrayMap(x -> reinterpretAsFloat32( bitAnd(reinterpretAsUInt32(x), 4294901760)), image_embedding).

This function iterates over each value x in the vector embedding, performing the transformation reinterpretAsFloat32(bitAnd(reinterpretAsUInt32(x), 4294901760)). This explains using the function reinterpretAsUInt32 as a binary sequence of Int32 and executing the same as the value 4294901760's bitAnd operation. The value of this latter is the binary sequence000000000000000001111111111111111. Therefore, this operation zeroes out the trailing 16 bits, performing an effective truncation. The resulting binary value is then reinterpreted as a float32.

We demonstrate this process below:

INSERT INTO default.laion_1m_bfloat16 SELECT
  _file,
  key,
  url,
  caption,
  similarity,
  width,
  height,
  original_width,
  original_height,
  status,
  NSFW,
  exif,
  arrayMap(x -> reinterpretAsFloat32(bitAnd(reinterpretAsUInt32(x), 4294901760)), text_embedding) AS text_embedding,
  arrayMap(x -> reinterpretAsFloat32(bitAnd(reinterpretAsUInt32(x), 4294901760)), image_embedding) AS image_embedding,
  orientation,
  software,
  copyright,
  image_make,
  image_model
FROM laion_1m

picture

As you can see below, doing this effectively reduced our compressed data by over 35% - 0s compressed very well.

SELECT
   table,
   name,
   formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
   formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
   round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE (table IN ('laion_100m', 'laion_100m_bfloat16', 'laion_10m', 'laion_10m_bfloat16')) AND (name IN ('text_embedding', 'image_embedding'))
GROUP BY
   table,
   name
ORDER BY table DESC

┌─table───────────────┬─name────────────┬─compressed_size─┬─uncompressed_size─┬─ratio─┐
│ laion_10m_bfloat16  │ text_embedding  │ 13.51 GiB       │ 28.46 GiB         │  2.11 │
│ laion_10m_bfloat16  │ image_embedding │ 13.47 GiB       │ 28.46 GiB         │  2.11 │
│ laion_10m           │ text_embedding  │ 18.36 GiB       │ 28.59 GiB         │  1.56 │
│ laion_10m           │ image_embedding │ 18.36 GiB       │ 28.59 GiB         │  1.56 │
│ laion_100m_bfloat16 │ image_embedding │ 134.02 GiB      │ 286.75 GiB        │  2.14 │
│ laion_100m_bfloat16 │ text_embedding  │ 134.82 GiB      │ 286.75 GiB        │  2.13 │
│ laion_100m          │ text_embedding  │ 181.64 GiB      │ 286.43 GiB        │  1.58 │
│ laion_100m          │ image_embedding │ 182.29 GiB      │ 286.43 GiB        │  1.57 │
└─────────────────────┴─────────────────┴─────────────────┴───────────────────┴───────┘

8 rows in set. Elapsed: 0.009 sec.

After reducing our precision to 16 bits, increasing the ZSTD compression level will have less impact on our compressed bfloat16. As shown below, ZSTD(3) has little effect on our compressed bfloat16.

SELECT
  table,
  name,
  formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
  formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
  round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE (table IN ('laion_100m_bfloat16', 'laion_100m_bfloat16_zstd_3')) AND (name IN ('text_embedding', 'image_embedding'))
GROUP BY
  table,
  name
ORDER BY table DESC

┌─table──────────────────────┬─name────────────┬─compressed_size─┬─uncompressed_size─┬─ratio─┐
│ laion_100m_bfloat16_zstd_3 │ text_embedding  │ 128.12 GiB     │ 286.85 GiB       │  2.24 │
│ laion_100m_bfloat16_zstd_3 │ image_embedding │ 127.28 GiB      │ 286.85 GiB       │  2.25 │
│ laion_100m_bfloat16       │ image_embedding  │ 133.80 GiB     │ 286.75 GiB       │  2.14 │
│ laion_100m_bfloat16       │ text_embedding   │ 134.59 GiB     │ 286.75 GiB       │  2.13 │
└────────────────────────────┴─────────────────┴─────────────────┴───────────────────┴───────┘

In addition to reducing disk space, there are other potential benefits to increasing compression. We demonstrate these benefits by querying tables containing 10m and 100m rows using embeddings encoded as float32 and bfloat16. These results are based on the same query we used previously.

Table Encoding Cold (secs) Hot (secs)
laion_10m Float32 12.851s 2,406s
laion_10m bloat16 7.285s 1.554s
laion_100m Float32 111.857s 24.444s
laion_100m bloat16 71.362s 16,271 p

We achieved significant improvements in the speed of linear scans here, improving our performance from 111 seconds to 71 seconds on a cold query on a 100m row dataset using the bfloat16 variant.

An obvious question might be how this reduction in precision affects our ability to represent concepts in vectors, and whether it leads to lower search quality. After all, we have reduced the information encoded in a multidimensional space and effectively squeezed our vectors "closer" together. Below, we use the new laion_100m_v2 table and our search.py Script that displays the previous query results for "a sleepy ridgeback dog".

python search.py search --text "a sleepy ridgeback dog" --table laion_100m_bfloat16

picture

Although there was no significant reduction in search quality for this search, it is likely that this will require relevance testing on a wider sample of queries. Users need to test this accuracy reduction technique on their specific models and datasets, and results may vary depending on the situation.

Extra vector fun

After reading an interesting blog post about how to use vector math to move in high-dimensional spaces, we wanted to see if we could apply the same concept to our CLIP-generated embeddings.

For example, suppose we have Berlin, Germany, Word embeddings for United Kingdom and Bridge . We can perform the following mathematical operations on their respective vectors.

(berlin - germany) + ('united kingdom' + bridge)

If we logically subtract and add the above concepts, we can assume that the result will represent a bridge in London.

To test this idea, we enhanced our simple search.py script to support a basic parser. This parser supports +, -, *  and  / operations, and '< /span> command. concept_math to represent multiple inputs and exposed through a 

Thanks to the great pyparsing library, building a parser for this grammar is simple. In summary, the above phrase will be parsed into the following syntax tree:

picture

We can in turn recursively compute the vectors of text items (leaves) in the tree above. The branches can then be combined using the equivalent vector functions provided in ClickHouse for the specified mathematical operators. This process is depth-first, parsing the entire tree into a single query (which should represent equivalent concepts).

Finally, this function matches the image_embedding column using the same process as the standard search. Therefore, the above resolves to the following query:

SELECT url, caption,
L2Distance(image_embedding,
  arrayMap((x,y) -> x+y,
      arrayMap((x,y) -> x-y, [berlin embedding], [germany embedding]),
      arrayMap((x,y) -> x+y, ['united kingdom' embedding], [bridge embedding])
  )
) AS score FROM laion_10m ORDER BY score ASC LIMIT 10

Note that we use arrayMap functions to do our point-wise addition and subtraction (consider supporting + and - operator as a point-wise operation).

We show these results below, matching a sample of 10m rows:

python search.py concept_math —-text "(berlin - germany) + ('united kingdom' + bridge)"

picture

so cool! It really works! Note that there is no mention of London Bridge in the text - the first image is part of a series of Claude Monet Waterloo Bridge paintings.

Finally, we thought it might be useful to enhance the grammar parser to support integer constants. Specifically, we wanted to see if the midpoint between two contrasting concepts yielded something interesting. For example, the artistic possibilities between the two concepts cubism and surrealism represents what? This can be expressed mathematically as  (cubism+surrealism)/2. Performing this search actually produced something interesting:

picture

We leave it to the artists among our readers to comment on the relevance and accuracy here.

This again shows another interesting possibility for combining vectors. No doubt this basic vector math can be useful in other situations. We'd love to hear any examples!

in conclusion

In this blog post, we show how to convert a vector dataset containing 2 billion rows into Parquet format and load it into ClickHouse. We show that this compresses well, that linear searches scale using CPU, and that metadata is used for full SQL-based analysis. Finally, we showcased some of ClickHouse's newer ANN capabilities and explored how UDFs can be used to provide elegant functions for generating embeddings.

picture

contact us

Mobile number: 13910395701

Email: [email protected]

Meet all your online column analysisDatabase managementneeds

Guess you like

Origin blog.csdn.net/ClickHouseDB/article/details/134615083