Several commonly used implementations of Join in Spark SQL

1 Introduction

Join is a common operation in SQL statements. A good table structure can disperse data in different tables, make it conform to a certain paradigm, reduce table redundancy, update fault tolerance, etc. The best way to establish a relationship between tables is the Join operation.

As an SQL implementation in the field of big data, SparkSQL naturally optimizes the Join operation. Today, we mainly look at the three common implementations of Join in SparkSQL.

2. Common implementations of Join in Spark SQL

2.1 、Broadcast Join

As we all know, in common database models (such as star schema or snowflake model), tables are generally divided into two types: fact tables and dimension tables. Dimension tables generally refer to fixed, less-changing tables, such as contacts, item types, etc., and generally have limited data. The fact table generally records flow, such as sales list, which usually expands with time.

Because the Join operation is to join records with the same key value in two tables, in SparkSQL, the most direct way to join two tables is to first partition according to the key, and then take the records with the same key value in each partition. out to do the connection operation. But this inevitably involves shuffle, and shuffle is a time-consuming operation in Spark. We should design Spark applications as much as possible to avoid a large number of shuffles.

When the dimension table and the fact table are joined, in order to avoid shuffle, we can distribute all the data of the dimension table with limited size to each node for the fact table to use. The executor stores all the data of the dimension table, which sacrifices space to a certain extent in exchange for a lot of time-consuming shuffle operations. This is called Broadcast Join in SparkSQL, as shown in the following figure:

Broadcast Join in Apache Spark sql
If you want to keep abreast of Spark, Hadoop or Hbase related articles, please pay attention to the WeChat public account: iteblog_hadoop
Table B is a smaller table, black means it is broadcast to each executor node, and each table of Table A is broadcast to each executor node. Each partition will get the data of Table A through the block manager. Obtain the corresponding record in Table B according to the Join Key of each record, and operate according to the Join Type. This process is relatively simple and will not be described in detail.

The conditions for Broadcast Join are as follows:

The table to be broadcast needs to be smaller than the value configured by spark.sql.autoBroadcastJoinThreshold. The default is 10M (or hint with broadcast join). The
base table cannot be broadcast. For example, when the left outer join is used, only the right table can be broadcast. It
seems that the broadcast is a An ideal solution, but does it have any drawbacks? Also obvious. This solution can only be used to broadcast small tables, otherwise the redundant transmission of data will be far greater than the overhead of shuffle; in addition, the broadcasted performance needs to be collected to the driver when broadcasting. Memory is also a test.

2.1 、Shuffle Hash Join

When the table on one side is relatively small, we choose to broadcast it to avoid shuffle and improve performance. However, because the broadcasted table is first collected to the driver segment, and then redundantly distributed to each executor, when the table is relatively large, the use of broadcast join will cause greater pressure on the driver and executor sides.

However, since Spark is a distributed computing engine, large batches of data can be divided into n smaller data sets for parallel computing in the form of partitions. The application of this idea to Join is Shuffle Hash Join. Using the principle that the same key must have the same partition, SparkSQL divides and conquers the join of the larger table, first divides the table into n partitions, and then performs Hash Join on the data of the corresponding partitions in the two tables, so that to a certain extent It reduces the pressure on the driver to broadcast one side table, and also reduces the memory consumption of the executor side to fetch the entire broadcast table. The principle is as follows:

Shuffle Hash Join in Apache Spark sql
If you want to keep abreast of Spark, Hadoop or Hbase related articles, please follow the WeChat public account: iteblog_hadoop
Shuffle Hash Join is divided into two steps:

Re-partition the two tables according to the join keys, that is, shuffle. The purpose is to assign the records with the same join key value to the corresponding partition. Join the data in
the corresponding partition. Here, the small table partition is constructed as A hash table, and then based on the join key value recorded in the large table partition,
the conditions for matching Shuffle Hash Join are as follows:

The average size of the partition does not exceed the value configured by spark.sql.autoBroadcastJoinThreshold. The default is 10M
. The base table cannot be broadcast. For example, when the left outer join is used, only the right table can be broadcast. The table
on one side should be significantly smaller than the other side, and the smaller one side will be broadcast (significantly smaller is defined as 3 times smaller, here is the empirical value)
We can see that in a table of a certain size, SparkSQL repartitions the two tables from the perspective of space-time combination, and Hash the partitions in the small table to complete the join. On the basis of maintaining a certain complexity, the memory pressure of the driver and executor is minimized, and the stability of the calculation is improved.

2.1 、Sort Merge Join

The two implementations described above are more suitable for tables of a certain size, but when both tables are very large, obviously whichever is applicable will cause a lot of pressure on the computing memory. This is because both of them use a hash join when joining, which is to completely load the data on one side into the memory, and use the hash code to connect the records with the same value of the join keys.

When both tables are very large, SparkSQL adopts a brand new scheme to join the tables, namely Sort Merge Join. This implementation method does not need to load all the data on one side and then enter the star hash join, but it needs to sort the data before joining, as shown in the following figure:

Sort Merge Join in Apache Spark sql
If you want to keep abreast of Spark, Hadoop or Hbase related articles, please pay attention to the WeChat public account: iteblog_hadoop As
you can see, first shuffle the two tables according to the join keys to ensure that the join keys have the same value Records are divided into corresponding partitions. After partitioning, sort the data in each partition, and then connect the records in the corresponding partitions after sorting, as shown in the following figure:

Sort Merge Join in Apache Spark sql

It can be seen that, regardless of the size of the partition, Sort Merge Join does not need to load all the data on one side into the memory, but it can be used and discarded immediately, thus greatly improving the stability of SQL join under large data volume.

Summary
This article introduces the 3 Join implementations in SparkSQL, which are actually nothing new. Traditional DB also has this way of playing, and SparkSQL just makes it a distributed implementation.

This article only introduces these implementations from a broad theoretical perspective. Specifically, how each join type is traversed, what to do when there are no join keys, and what specific requirements these implementations have for join keys, these details are not shown. If you are interested, you can go to the source code.

This article is reproduced from: http://blog.csdn.NET/asongoficeandfire/article/details/53574034

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325429785&siteId=291194637