PostgreSQL and Greenplum comprehensive interpretation of Hash Join

October 15, 2019, Pivotal China R & D Center Vice President and Greenplum Chinese community sponsor Yao Yandong attended the PostgreSQL Conference Europe, held in Italy and delivered a wonderful speech "How does Hash Join work in PostgreSQL and its derivates". According to this article gathered from speeches, for them to learn the exchange.

Today I will detail PostgreSQL and Greenplum's Hashjoin. The reason for selecting Hashjoin this topic, because HashJoin is an important weapon or analytical processing OLAP queries (analytics queries) of. First, let's take a look at PostgreSQL in Hashjoin achieved.

Before introducing HashJoin achieve, at first understand what is JOIN. According to Wikipedia (WIKIPedia), JOIN is a relational database or a combination of a plurality of columns of the table operator.

And there are many types of JOIN, SQL standard defines INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN four types, with set theory where the operation is very easy to understand. We figure below intuitive explanation of these four types of JOIN utility.

Wherein JOIN addition type, for example, the SEMI and JOIN ANTI JOIN. Both SQL JOIN syntax is not, however, commonly used to achieve certain SQL functions, they will be detailed later.

It will be used herein, such an example. Examples which contains two tables: student score table and the table, each table has a number of records.

First we see the figure in the corresponding SQL JOIN statement. You can view the six types of SQL JOIN example via explain. (Need to set enable_mergejoin, enable_hashagg is OFF, otherwise the optimizer might choose a different query plan)

The figure below shows the results, let us have an intuitive awareness of the role of the JOIN figure.

JOIN There are three classical algorithm: Nested Loop, Merge JOIN, Hash Join. They have advantages and disadvantages, such as bad Nest loop performance generally, but applies to any type of JOIN; Merge join the pre-sorting of data performance is very good; HashJoin large amount of data is usually best performance, but can only handle equijoin, and can not handle such c1> c2 join such conditions.

Hashjoin is a classical algorithm, which comprises two Phases, Build is a phase, a small table of hash table constructed Ideally, the table is generally also called inner table; Probe phase of the second phase, another scan associated table of contents, whether there are rows and / tuple matching a hash table by detecting, this table is usually referred to as outer table.

Let's start with inner join to start. The figure below there are examples of a inner join, left its query plan is the right plan is a graphical tree. Wherein the inner table is also commonly referred to as a right table; outer table called left table.

First build phase, at this stage, for each tuple of the inner table scan, calculating the hash value based on the join key, and the corresponding place in the hash table buckets. After scanning inner table, also completed the construction of the entire hash table.

The second stage probe phase, each of the outer table scan tuple (tuple), calculates a hash value of the tuple, and a hash value calculated outer table, the hash table to check whether a matching tuple, if and have to meet all the query conditions, the tuple is output. Sequentially processing each outer table tuple.

The next example will now be the full outer join, different from the inner join is how to deal no matching tuple. Still scans the inner table to build hash table, and then scan the outer table. If the join key match and meet all the query conditions, the tuple corresponding to the correlation result is output. If the hashtable no matching tuple, the tuple is output, and is filled with a null correlation result table corresponding to the column inner. When the scan is finished outer table, is scanned again hash table, find all matches have not had inner tuple, outputting the tuple, and filled with a null correlation result corresponding to the outer table column.

Here's a relatively confusing place: the internal type of SQL JOIN JOIN semantics and implementation type. We look at an example, the same two SQL LEFT JOIN internal use Hash Left JOIN or Hash Right Join. The first example is a large table left join a small table, its interior is achieved JOIN type Left join; The second example is a small table left join a large table, which is a type of internal implementation JOIN right join. The reason is that the optimizer low as possible to make the table in the table, constructed in the hashtable thereof.

Let us look at semi join. Semi join is usually used to implement EXISTS. It is similar to inner join, only semi join different support is concerned there is no match, not on how many tuples match.

Anti JOIN is output when no matching tuple, to implement NOT EXISTS.

Achieve front looks very elegant and intuitive, but not considered a problem: If the inner table into memory, how much can not do? The idea is to solve the classic divide and rule. Grace hash join is the classic algorithm to solve this problem: it is the inner table and the outer table in accordance with the associated key into multiple partitions, each partition saved to disk, hash join algorithm is then applied to the front of each partition mentioned. Each partition into a batch (first batch). The basic idea is to calculate the hash value in accordance join key, but the calculation of a hash value and corresponding batchno bucketno: algorithm:

  • bucketno = hashvalue MOD nbuckets

batchno = (hashvalue DIV nbuckets) MOD nbatch

  • nbuckets is the number of buckets, nbatch a batch number, a power of 2, both can thus be obtained by bit operation bucketno and batchno

Hybrid hash join optimization is a top of grace hash join: the first batch does not have to be written to disk, to avoid the first batch of disk io.

hybrid hash join on the inner table first partition / sub-batch, according to the algorithm previously calculated batchno, if tuple belongs batch
0, memory hashtable is added, otherwise it is written to the disk file corresponding batch. batch 0 not written to disk file.

Then the outer table partition / min BATCH, if the outer table of tuple belongs batch 0, hashjoin aforementioned algorithm is performed: determining whether there is a match of the inner tuple hashtable Outer tuple, if present and where all the conditions are satisfied, the We found a match, output, or continue to the next tuple. If the outer tuple does not belong batch 0, it is written to the disk file corresponding batch.

Outer table when the end of the scan, batch 0 are also finished processing. Processing continues batch 1: Load batch inner 1 of table data into the temporary memory, Hashtable constructed, and then the scan table batch outer 1 temporary data, performs the foregoing operation jumps to the probe. Upon completion of batch 1, processing continues batch 2, until all batches.

This chart below describes how to determine whether a plurality of batch: If the size of the inner table plus the overhead is less than buckets work_mem, using a single BATCH; otherwise require the use of a plurality of batches.

Algorithm Input:

  • Plan_rows: Estimated number of rows of the inner table
  • Plan_width: estimated average line width of inner table
  • NTUP_PER_BUCKET: tuples of data of a single bucket, this value is 10 the old version, the new version is 1, assuming that hash conflicts are rare, on average a bucket containing a tuple
  • Work_mem: memory allocation of quotas for hashjoin

So if batch 0 is still too large, insufficient memory to accommodate how to do?

Approach is to double the number of batches, from n becomes 2n. 0 then will re-scan batch inside tuples, 2n recalculated according to batch to which it belongs, if the batch is still a 0, is retained in memory, or removed from memory, written to a new tuple corresponding batch.

By this time the tuple does not move Batch file, when the processing of the batch will be processed.

As the number of batch changes, then some batch inside the tuple may not belong to the current batch. Hybrid hash join algorithm (modulo operation) to ensure that, after doubling the number of batch, batch tuple belongs only rearwardly without forward.

When processing Batch i, if too many inner tuple of the batch, taking up too much space, it is possible that memory still fit.

This will result in doubling the number of batch continues. As shown below.

batchno batch in the tuple belongs also changes the current. As a specific example, assume Nbatch = 10; after 2 doubled, Nbatch = 40; batch already in three tuple satisfies hashvalue% 10 = 3, so the batch 3 tuple of HashValue may be 3, 13, 23, 33 , 43, 53, 10 ... when nbatch changed from 40, hashvalue% 40 may be the result of 3, 13, 23, 33.

PostgreSQL's hashjoin implemented on top of the classic Hybrid hash join also do some optimization, optimization is an important optimization of the tilt data. Many non-normal distribution of data in reality, such as a pet, it is assumed that everyone on the planet has a pet, the cat or dog will occupy the majority.

Skew optimized core idea is to try to avoid disk io: in batch 0 stage processing outer table of the most common (MCV, Most common value) data. The reason MCV MCV rather than inner table of the outer table is to optimize the choice usually choose a small table and inner table do normal distribution table, this outer table will be even greater, or greater probability of non-normal distribution.

First prepared skew hash table, comprising three steps:

  • Determining the skew hash table size. 2% default allocation PostgreSQL user build memory skew hash table, and calculating how many MCV tuples can accommodate.
  • The pg_statistic syscache data obtained MCV outer table of statistics, for each of MCV, calculated hash value, and places it in the corresponding skew hash bucket, there is no treatment inner table, so that the bucket points to NULL. hash conflict resolution is linear growth, if the current slot is occupied, then take up the next. Calculated skew buckets for the size, we will ensure that skew hashtable sparse enough to avoid the revolution could not find the spare slot.
    • Filling skew hash table: Scan inner table build main hashtable, if the current tuple belongs skew hash table (corresponding to the slot is not empty), it is not added to the skew hashtable main hash table.

Scan outer table, if MCV tuple, skew hash table is used for processing. Otherwise, according to Hybrid hash join algorithm processing described earlier. Assuming skew optimization, 50% of MCVs stage process in batch 0, then saved about 50% of the disk io.

Not here parallel JOIN, mainly parallel join achieve PostgreSQL hashjoin looks elegant, introduced about 1 times the amount of code processing in parallel hash join. And Greenplum provides a very elegant solution to handle parallel hash join, almost no change of code.

Next, let 's get Greenplum in HashJoin.

First introduced to Greenplum. The cluster nodes on many PostgreSQL Greenplum nature, however, only the PostgreSQL many nodes together, the user can provide a transparent, satisfy the logical database ACID.

Greenplum team in the distributed data storage, distributed query optimization that end, distributed execution, memory, transaction management, concurrency control, cluster management and other areas a lot of work, provide the user with a high-performance, linear scalability, full-featured the logical database.

This is a typical topology of Greenplum. Greenplum the CS
mode, both master and segment nodes, each node has its own storage, memory and computing unit, communicate through a network between, this architecture is also referred to as a shared-nothing architecture MPP. The whole structure from the disk, segments, networks and master
all levels, etc. are highly available.

Greenplum There are two key concepts:

  • Distribution strategy: controlling each tuple stored on that segment, currently supports hash distributed, randomly distributed, replicated table, as well as custom
  • Motion: transferring data between different segments, there are three ways: Gather, redistribution and broadcasting

The distribution key example of two tables inside are students of the id, so examples of SQL join operation can be performed in the machine, and then do the Gather motion master of a summary of the results. PostgreSQL and SQL execution of this
query execution is very similar.

This case the student table in accordance with student id distributed data to different segments, while the score sheet in accordance with the id distribution data score sheet to different segments, so that the same students score information may be distributed over different nodes, so the page picture inquiry program will produce incorrect results. To solve this problem, query plan introduces a Broadcast motion. Such Hashjoin node broadcast the outer node is a node motion, the total amount of data can be obtained in the outer table to ensure that the result is correct hashjoin.

This image shows a page of the SQL execution flow. For more information Slice, Gang, etc., you can refer to the book Greenplum team published "Greenplum: from the big data strategy to achieve."

Greenplum and PostgreSQL Hashjoin achieve similar. There are several points made enhancements:

  • Support compress the temporary batch file, zstd for compression and decompression speed and compression ratio between doing a good balance, so the use of zstd compression algorithm,
  • Left anti semi join added to optimize the type of scene NOT IN.
  • Visible single node parallel PostgreSQL and Greenplum databases at the implementation level hashjoin are based Hybrid hash join algorithm, the actuator level implementation details little different, the main change is that the optimizer level. Other parallel databases such as CitusDB as well.

This article does not involve as much as possible details of the code, but speaking from logic level to achieve clear Hahsjoin logic. If you are interested, you can refer to the code. We appreciated that the processing logic, the code looks relatively easy. And the corresponding code nodehash.c nodehashjoin.c

Main code logic execHashJoin (), the function implements a state machine, which has six main states, the state transition substantially as shown in the FIG., Only for reference.

Released six original articles · won praise 0 · Views 48

Guess you like

Origin blog.csdn.net/gp_community/article/details/104778618