Pigsty v2.3.1: PGVECTOR for HNSW is here

Pigsty v2.3.1 is now released . In this version, PGVECTOR welcomes the v0.5 epic update, adding new HNSW index support. There is also support for the newly released PostgreSQL 16 RC1. In addition, the official documents are now available in Chinese, and the existing documents have also been enriched and improved. Finally, there are routine software version updates and bug fixes.

PGVECTOR for HNSW?

PGVECTOR is a very practical and powerful PostgreSQL extension, which allows you to have complete vector data storage and retrieval capabilities on the existing relational database PostgreSQL.

On September 1st, PGVECTOR released a major version update v0.5, introducing a new type of index: HNSW. In ANN Benchmark, the recall rate and performance are greatly improved compared with the original IVFFLAT index, and the comprehensive performance is not inferior to the professional vector database! In addition, the distance calculation performance is significantly improved compared to 0.4.4 (especially the L2 Euclidean distance). Finally, the parallel creation capability of IVFFLAT index: it can speed up the time-consuming index creation operation by N times!

Trade-offs of vector databases

The vector database indexing algorithm also has an "impossible triangle": quality, efficiency, cost

Quality: The recall (correctness) of the query results

Efficiency: response time of queries

Cost: memory size required for query//index construction speed

For the typical scenario of vector database - semantic search. The most important attribute is undoubtedly recall, which directly affects the end-user experience. Performance is usually not a big problem, as long as it is enough: model encoding usually takes tens of milliseconds to hundreds of milliseconds, and optimizing a query from 10 milliseconds to 1 millisecond is meaningless for user experience; throughput can be achieved through infinite dragging library solution. Memory usage/build speed is usually the least important - it is not perceived by users, and as long as it can run, at the current memory price, the problem that can be solved by adding money and resources is usually not a problem.

The classic brute force full table search algorithm has the best irreplaceable quality, but its performance is very poor, and it requires very little memory resources (scanning in sequence, loading into memory in turn). The original IVFFLAT index used by PGVector has good performance, but the recall rate performance is average, and the memory usage is average. Now the newly added HNSW index has excellent performance in recall rate and performance. The biggest disadvantage is the slow index construction speed and high memory usage.

The HNSW algorithm is almost the standard configuration of a dedicated vector database, because it has excellent performance in quality and performance. Although it has the disadvantage of slow index construction, it also has several additional benefits:

  • "Incremental maintenance": With HNSW indexes, you can create indexes on empty tables and add vectors at any time without affecting recall! This is different from IVFFLAT: IVFFLAT first needs to load vector data before building an index, and uses Kmeans to find the best center point in order to have the best recall rate. When you update a lot of data, you may need to rebuild the index to have the best recall effect.

  • Updates and deletes: pgvector's HNSW implementation allows updates and deletes, you can use standard UPDATE / DELETE statements. Many HNSW vector database implementations do not support this feature!

This means that HNSW indexes require almost no additional maintenance work: you can write/update concurrently, maintain incrementally, and create new indexes without affecting read and write requests, and you don't have to worry about data changes affecting the quality of the index. Data is persistent, protected by PITR point-in-time recovery, and seamlessly replicated to slaves using standard WAL infrastructure.

PGVector's HNSW implementation has excellent performance, significantly better than pg_embedding (disk-based implementation) according to Jonathan Katz's test results. Various vector extension forks of PostgreSQL may face huge pressure of survival competition.

At the same time, the original IVFFLAT index also has a series of improvements: you can now create IVFFLAT indexes in parallel. Depending on the degree of concurrency, it can be accelerated several times to ten times. The performance of the distance measurement function has also been optimized, for example, the L2 distance calculation has a 36% performance improvement under ARM64.

PGVECTOR's strategic slots

Implementing a vector data type/index function requires only a few thousand lines of code, while implementing a fully functional, stable and reliable database such as PostgreSQL requires millions of lines of code, and the complexity is simply not the same. A qualified vector database must first be a qualified database, and it is not easy to do this from scratch. The wrong route will cost a hundred times the effort to do useless work. PGVector chooses to stand on the shoulders of giants instead of reinventing a database with earth-making wheels. This is a wise and pragmatic approach.

For example, you can combine vector functions with various PostgreSQL-provided functions: use expression indexes to index only a part of the content of interest, view the progress of index creation in real time, create/rebuild indexes online without blocking read and write requests, use multiple processes Create indexes concurrently.

Of course, Vector fuzzy search/semantic search can also be used in combination with PostgreSQL's original full-text search/inverted index function to provide better interpretable search results. You can also continue to use standard SQL to perform precise filtering on metadata/various fields, and combine precise search with fuzzy search. And avoid the hassle of moving data back and forth among multiple dedicated data components.

The only bright spot feature of a dedicated vector database is performance. However, the HNSW implementation of pgvector makes this unique bright spot face challenges. I am afraid that we will soon see the history of geographic databases, document databases, time series databases, and distributed databases. Once again.

PostgreSQL 16 RC 1

The first release candidate RC1 of PostgreSQL 16 was released on September 1st! The official release date is September 14.

Pigsty is probably the first distribution to offer PostgreSQL 16 support: since 16 beta1. Although it is still not officially released, you can already pull up a high-availability cluster of PostgreSQL 16. PostgreSQL 16 has some useful new features: logical decoding and logical replication from the library, new statistical views for I/O, parallel execution of full connections, better freezing performance, new set of functions compliant with the SQL/JSON standard, and Use regular expressions in HBA authentication.

Pigsty pays special attention to the observability improvement in PostgreSQL 16. The new pg_stat_io view allows users to directly access important I/O statistics indicators from the database, which is of great significance for performance optimization and failure analysis. In the past, users could only see limited statistical indicators on the database/BGWriter. If you want more detailed statistics, you can only analyze the I/O indicators at the operating system level. Now, you can gain in-depth insight into behaviors such as read/write/append/refresh/Fsync/hit/eviction from the three dimensions of backend process type/relationship type/operation type.

Another very valuable observability improvement is that pg_stat_all_tables and pg_stat_all_indexes record the time of the last sequential/index scan. Although this function can be realized by scanning statistical charts in Pigsty's monitoring system, it is definitely better for the official to provide direct support: users can draw some conclusions intuitively: for example, whether a certain index is useless and can be considered for removal. In addition, the n_tup_newpage_upd indicator can tell us how many rows on the table are not updated in-place on the page when they are updated, but moved to a new page. This indicator has important reference value for optimizing UPDATE performance and adjusting the table fill factor.

In Pigsty v2.3.1, the RPM package of PostgreSQL 16 RC1 has been included in the offline package of EL 8 / EL 9 by default. The official version of PostgreSQL 16 will be followed immediately in the next Release.

Chinese document

Since v2.0, Pigsty's official documentation has been in English only. But starting from this version, the Chinese documentation is back! Thanks to GPT-4 for providing high-quality Chinese-English translation. The original English document has also added a lot of supplementary content.

In addition, the address of the original Github Pages hosting document is https://vonng.github.io/pigsty/ , and now there is a new dedicated domain name https://doc.pigsty.cc . To solve the problem of inconvenient access for users inside the wall, a mirrored official site is also provided: https://pigsty.cc , where the Chinese and English bilingual documents have also been updated to the latest version of v2.3.1.

Another change is that Pigsty's public demo https://demo.pigsty now also uses a formal HTTPS certificate issued by the authority instead of Pigsty's self-signed certificate.

test environment

In the past, Pigsty's testing has always been a headache: five supported PG database major versions (PG 12, 13, 14, 15, 16), three major versions of the operating system (EL7, EL8, EL9) , and permutations between compatible distribution variants (RHEL, CentOS, Alma, Rocky) of several operating systems are tested.

Therefore, in v2.3.1, a new configuration file check.yml is provided, and 30 database clusters with different operating systems, different major versions, and different specifications are pulled up for testing at one time. This configuration file (check.yml) also demonstrates the installation methods of different major versions, which has certain reference value.

bug fix

Pigsty v2.3.1 fixes two bugs in v2.3.0.

The first question is related to Watchdog: When you want to avoid split brain in extreme cases, you can set the patroni_watchdog_mode parameter to required, in this case, Patroni's service will now automatically perform modprobe softdog and chown postgres watchdog operations to ensure that Patroni Possess permission to use watchdog.

The second issue is about downloading packages from upstream: when you use a wildcard '*' in a downloaded package, it will now be escaped with single quotes to avoid being affected by packages already in the target directory.

Software update

In addition, in the offline package, some packages are also updated to the latest version. For example, Grafana was upgraded to v10.1, which introduced some interesting new features. Loki / Promtail was upgraded to 2.8.4, the intermediate component FerretDB that provides MongoDB compatibility was upgraded to 1.9, the PG log analysis component pgbadger was upgraded to 1.12.2, and the TimescaleDB extension was upgraded to 2.11.2.

Also added to the default package of Pigsty v2.3.1 is SealOS, a stripped-down binary software for quickly deploying Kubernetes clusters. Pigsty will have further support for Kubernetes and the database in Kubernetes in subsequent versions.

Guess you like

Origin www.oschina.net/news/256789/pigsty-v2-3-1-released