This article is shared from Huawei Cloud Community " GaussDB (DWS) Vectorized Execution Engine Detailed Explanation ", author: yd_212508532.
Preface
- Applicable version: [Baseline function]
Most traditional row execution engines adopt a one-tuple-at-a-time execution mode. In this way, most of the time during the execution process, the CPU is not used to process data, but more to traverse the execution tree, which will lead to low effective utilization of the CPU. In the face of a huge number of function calls in OLAP scenarios, huge overhead is required. In order to solve this problem, a vectorization engine was added to GaussDB (DWS). The vectorization engine uses a batch-at-a-time execution mode of tuples, which can greatly reduce the cost of traversing execution nodes. At the same time, the vectorization engine is also naturally connected to column storage, making it easier to load vectorized column data in the underlying scanning nodes. Column storage + vectorized execution engine is one of the golden keys that opens the door to OLAP performance!
About row storage and column storage tables
The row storage table stores tuples into Page pages by rows. It is mostly used in TP scenarios where the data is frequently updated, there are many additions, deletions and modifications, and the query results involve multiple columns of the table.
Column storage tables are stored in columns, and each column's data is stored in a file. Mostly used in AP scenarios.
- The number of table columns is large, the number of accessed columns is small, and the number of IO operations is reduced.
- Column data is homogeneous, improving data compression ratio
- Operations based on column batch data, the CPU cache hit rate is high
execution framework
The executor is the hub of interaction between the optimizer and the storage engine. Taking the execution plan tree generated by the optimizer as input, the data is accessed from the storage engine, and various execution operators are operated according to the plan to realize data processing. Using the Pipeline mode, the row executor operates one tuple at a time, and the column executor operates one batch at a time. The upper layer drives the lower layer, allowing data to flow up the execution tree. Provides execution operators for various data processing. The figure below shows the top-down control flow and bottom-up data flow.
The execution process of the executor can be divided into these three steps:
- Executor initialization: Construct the executor's global status information estate, recursively traverse each node of the plan tree, and initialize its execution status information planstate
- Execution of the executor: The entrances of the row engine and the vectorization engine are independent. Starting from the root node of the plan tree, it recursively traverses to the leaf nodes to obtain a tuple/batch. After processing by layer-by-layer node operators, a result tuple/batch is returned until No more tuple/batch.
- Cleaning up the executor: Recycle the global status information of the executor and clean up the execution status of each plan node.
column executor
The problem with the row executor is that most of the CPU processing is in the process of traversing the Plan Tree instead of actually processing the data, and the effective CPU utilization is low. The unique application scenarios of column storage tables require a supporting vectorization engine to truly take advantage of its performance improvement in OLAP scenarios. Therefore, the basic idea of transforming the column executor is to process one column of data at a time.
Like the row executor, the vectorized execution engine scheduler follows the Pipeline mode, but each processing and data transfer between operators is one batch at a time (that is, 1000 rows of data), which improves the CPU hit rate and reduces IO read operations. The data flow structure of the column executor VectorBatch is shown in the figure below.
Mixing rows and columns: Adapter operator
Some scenarios of column storage tables do not support vectorized execution engines, such as: string_to_array, listagg, string_agg, etc.
GaussDB has the ability to automatically switch between two sets of row and column engines.
For column storage data, if there is only a row engine, it is usually necessary to reconstruct the column data into tuples for the execution engine to process row by row. The Tuple deform process affects the performance of column storage data query processing.
Vectorized execution engine performance
Comparing the calculation performance of the same expression x*(1-y) by the row and column storage engine, we can see that the Cstore Scan operator of the column storage engine takes 85% less time than the Seq Scan operator of the row storage engine.
The characteristics of vector computing are: calculating multiple values at one time, reducing function calls and context switches, and making full use of the CPU cache and vectorized execution instructions to improve performance.
Performance advantages of vectorized execution engines:
- One batch at a time, read more data and reduce the number of IO reads
- Due to the large number of records in the Batch, the cache hit rate of the corresponding CPU increases.
- The number of function calls during Pipeline mode execution is reduced.
- Matched with column storage tables to reduce tuple deform, that is, the time overhead of reconstructing tuples from column storage data
Comparison of operators of row/column executors
The execution operators of the vectorization engine are similar to the row execution engine, including control operators, scan operators, materialization operators and connection operators. It will also be represented by nodes, inherited from row execution nodes, and the execution process will be recursive. The main nodes included are: CStoreScan (sequential scan), CStoreIndexScan (index scan), CStoreIndexHeapScan (using Bitmap to obtain tuples), VecMaterial (materialization), VecSort (sorting), VecHashJoin (vectorized hash connection), etc., which will be discussed one by one below. Introduce these execution operators.
Scan operator
The scan operator is used to scan the data in the table, and each time it obtains a tuple as the input of the upper node, it exists in the leaf node of the query plan tree. It can not only scan the table, but also scan the result set of the function, the linked list structure, and the child. Query result set. Some of the more common scan operators are shown in the table.
Operator (row/column storage operator) | meaning | Appear scene |
---|---|---|
SeqScan/ CStoreScan | sequential scan | The most basic scan operator, used to scan physical tables (sequential scan without index assistance) |
IndexScan/CStoreIndexScan | index scan | An index is created on the attributes involved in the selection criteria |
IndexOnlyScan/CStoreIndexOnlyScan | Return tuple directly from index | Index columns completely cover result set columns |
BitmapScan(BitmapIndexScan, BitmapHeapScan) / CStoreIndexHeapScan (CStoreIndexAnd, CStoreIndexOr,CStoreIndexCtidScan) | Use Bitmap to get tuples | BitmapIndexScan uses the index on the attribute to scan and returns the result as a bitmap; BitmapHeapScan obtains the tuple from the bitmap output by BitmapIndexScan |
TidScan | Get tuple by tuple tid | 1.WHERE conditions(like CTID = tid or CTID IN (tid1, tid2, …)) ;2.UPDATE/DELETE … WHERE CURRENT OF cursor |
SubqueryScan/VecSubqueryScan | subquery scan | Use another query plan tree (subplan) as the scan object to scan tuples |
FunctionScan | function scan | FROM function_name |
ValuesScan | Scan the values linked list | Scan the collection of tuples given by the VALUES clause |
ForeignScan/VecForeignScan | External table scan | Query external table |
CteScan/VecCteScan | CTE table scan | Scan subqueries defined with WITH clause in a SELECT query |
connection operator
The join operator corresponds to the join operation in relational algebra. Taking table t1 join t2 as an example, the main centralized join types are as follows: inner join, left join, right join, full join, semi join, anti join , and their implementation methods include Nestloop ,HashJoin,MergeJoin ;
Operator (row/column storage operator) | meaning | Appear scene |
---|---|---|
NestLoop/VecNestLoop | Nested loop connection, violent connection, scan the internal table for each row | Inner Join, Left Outer Join, Semi Join, Anti Join |
MergeJoin/VecMergeJoin | Merge connection (input order), sorting of inner and outer tables, positioning the first and last ends, and connecting tuples at once. Equijoin | Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, Semi Join, Anti Join |
HashJoin/VecHashjoin | Hash join, the inner and outer tables use the hash value of the join column to create a hash table, and the same values must be in the same hash bucket. Equijoin | Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, Semi Join, Anti Join |
materialization operator
Materialized operators are a type of node that can cache tuples. During execution, many extended physical operators need to first obtain all tuples before they can operate (such as aggregate function operations, sorting without index assistance, etc.). This requires using materialization operators to cache the tuples;
Operator (row/column storage operator) | meaning | Appear scene |
---|---|---|
Material/VecMaterial | materialize | Cache child node results |
Sort/VecSort | sort | ORDER BY clause, connection operation, grouping operation, set operation, with Unique |
Group/VecGroup | Grouping operations | GROUP BY child clause |
Agg/VecAggregation | Execute aggregate function | 1. Aggregation functions such as COUNT/SUM/AVG/MAX/MIN; 2. DISTINCT clause; 3. UNION to remove duplicates; 4. GROUP BY clause |
WindowAgg/VecWindowAgg | window function | WINDOW clause |
Unique/VecUnique | Deduplication (the lower level has been sorted) | 1. DISTINCT clause; 2. UNION deduplication |
Hash | HashJoin helper node | Construct a hash table and cooperate with HashJoin |
SetOp/VecSetOp | Handling collection operations | INTERSECT/INTERSECT ALL, EXCEPT/EXCEPT ALL |
LockRows | Handling row-level locks | SELECT … FOR SHARE/UPDATE |
control operator
Control operators are a type of node used to handle special situations and implement special execution processes.
Operator (row/column storage operator) | meaning | Appear scene |
---|---|---|
Result/VecResult | Calculate directly | 1. Does not include table scan; 2. There is only one VALUES clause in the INSERT statement; 3. When Append/MergeAppend is the plan root node (projection push-up) |
ModifyTable | INSERT/UPDATE/DELETE upper node | INSERT/UPDATE/DELETE |
Append/VecAppend | addition | 1. UNION(ALL); 2. Inheritance table |
MergeAppend | Append (input ordered) | 1. UNION(ALL); 2. Inheritance table |
RecursiveUnion | Handling UNION subqueries defined recursively in WITH clause | WITH RECURSIVE … SELECT … statement |
BitmapAnd | Bitmap logical AND operation | BitmapScan for multidimensional index scanning |
BitmapOr | Bitmap logical OR operation | BitmapScan for multidimensional index scanning |
Limit/VecLimit | Handling LIMIT clauses | OFFSET … LIMIT … |
Other operators
Other operators include Stream operators, and operators such as RemoteQuery
Operator (row/column storage operator) | meaning | Appear scene |
---|---|---|
Stream | Multi-node data exchange | Execute a distributed query plan, and there is data exchange between nodes |
Partition Iterator | Partitioned iterator | Partition table scan, iteratively scan each partition |
VecToRow/RowToVec | Column to row/Row to column | Mixed scene of ranks and ranks |
DfsScan / DfsIndexScan | HDFS table (index) scan | HDFS table scan |
The evolution of Gaussdb vectorization
After the first generation vectorization engine, GaussDB evolved vectorization engines with higher performance: Sonic vectorization engine and Turbo vectorization engine.
In order to improve OLAP execution performance, GaussDB continues to evolve on the road of column storage + vectorized execution engine and batch calculation:
- Stream operator + distributed execution framework supports data flow between multiple nodes
- SMP, multi-thread parallelism within the node, making full use of idle hardware resources
- LLVM technology, a new code generation framework, JIT (just in time) compiler, eliminates tuple deform bottlenecks
- Sonic vectorization engine further vectorizes HashAgg and HashJoin operators, and implements different Arrays to calculate data according to different types of each column.
- The new generation Turbo vectorization engine further vectorizes most operators. Based on the Sonic engine, Null optimization, large integer optimization, Stream optimization, Sort optimization, etc. are added to further improve performance.
Summarize
This article introduces the GaussDB vectorized execution engine, and elaborates on its framework, principles, overview of each operator, and performance improvement.
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
I decided to give up on open source Hongmeng. Wang Chenglu, the father of open source Hongmeng: Open source Hongmeng is the only architectural innovation industrial software event in the field of basic software in China - OGG 1.0 is released, Huawei contributes all source code Google Reader is killed by the "code shit mountain" Fedora Linux 40 is officially released Former Microsoft developer: Windows 11 performance is "ridiculously bad" Ma Huateng and Zhou Hongyi shake hands to "eliminate grudges" Well-known game companies have issued new regulations: employee wedding gifts must not exceed 100,000 yuan Ubuntu 24.04 LTS officially released Pinduoduo was sentenced for unfair competition Compensation of 5 million yuan