hive--optimization
to sum up
- ps: The summary of this paragraph is just to facilitate my overall understanding, if something is unreasonable, welcome guidance
- For a piece of sql, we can analyze it
SQL execution order
- (7) SELECT
- (8) DISTINCT <select_list>
- (1) FROM <left_table>
- (3) <join_type> JOIN <right_table>
- (2) ON <join_condition>
- (4) WHERE <where_condition>
- (5) GROUP BY <group_by_list>
- (6) HAVING <having_condition>
- (9) ORDER BY <order_by_condition>
- (10) LIMIT <limit_number>
- FROM ->on-> join-> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> UNION -> ORDER BY-> limit
What these keywords determine
select 1 from 2 where 3 group by 4 having 5 order by 6 limit 7
- 1- Determines which columns are in the result: either an existing column or a column generated by a function, column filtering
- 2- Determine the data source to read the data
- 3- Decided which rows to filter
- 4- According to what conditions to group
- 5- which rows to filter after grouping
- 6- Sort according to what conditions
- 7-Limit output
How to optimize the correspondence?
- Hive converts a query into one or more stages. Such stages can be MapReduce stage, sampling stage, merge stage, limit stage.
- We know that the bottom layer of hive is mapreduce. For hive, optimization can be divided into two directions to optimize, namely sql optimization and parameter optimization, which
本质大部分都是针对mapreduce进行优化
, for example, reduce the amount of data, thereby reducing data transmission; such as a reasonable number of maps and reduce numbers , So as to execute in parallel.
First of all, for the input stage of mr
- Reduce the amount of data
- If it is a partitioned table, perform data query to specify the partition, that is, as far as possible
在where中使用分区字段
,减少数据量
- If there is a temporary table in the logic, do a good job of column clipping (
即只保留我们需要的列
),减少数据量
- If you have the
先过滤在join
Join, ,减少数据量
Second, the map stage for mr
- A small file will start a map. If there are many small files, the time for starting and initializing a map task may be much longer than the logic processing time. That is to say, we can merge small files
减少map数
and configure onecombine参数可以执行前进行小文将合并
, onemerge参数执行mr结束时合并小文件
, and at the same time. The size of the smallest/large slice can be configured (refer to https://blog.csdn.net/qq_46893497/article/details/113864209 ) - Of course, there are also files of about 128m that only start one map, but the logic is very complicated. At this time, we need to split this file into multiple reasonably
用多个map
to complete the task and configure mrtask数
. - You can also open the map side
推测执行
- Can also be turned on
mapjoin
Third, the shuffle stage for mr
- Configuration
压缩参数
,减少网络输出
- Try to specify the conditions when associating,
减少笛卡尔积的产生
- Can design bucket table
Fourth, the reduce phase for mr
- Adjust the appropriate number of redcers
- By setting
每个reducer处理的字节数
, ortask数
, avoid too many small files - Replace distinct with group by first
- By setting
- Distribute the data as evenly as possible to each reduce
- Start two mr, pass parameters
skewjoin
, first randomize, and then aggregate - For too many keys in on, you can add a prefix
空值的key变字符串+随机数
Assigned to different reduce, null is not related and does not affect the result- Can also be turned on
reduce端的推测执行
- Start two mr, pass parameters
Fifth, the output stage for mr
merge参数执行mr结束时合并小文件
- Configuration
压缩参数
, reduce data storage space
Sixth, focus on the overall situation
- Open mr
推测执行
- Open mr
jvm重用
- Configure compression
- Turn on
并行执行
*********************************The following directions can be studied in depth ************** *********************
Configuration optimization
- Optimization of SQL limit
- The execution engine is changed to spark-fast query
Compression configuration optimization
- Map-side output reduces network transmission-whether it can be split
- Shuffle process reduces network transmission-fast
- Reduce the output of the reduce end to reduce the storage space-compression ratio
Zipper table use
Bucketing
- The partition is further split into buckets to obtain higher query efficiency
- You can't load it directly, you need to turn on the bucket parameter first, then create a temporary table, and then insert
Parallel operation
- By default, only a section of HiveSQL can be compiled and locked
- Parallelism can be turned on to ensure simultaneous compilation and maximum parallelism
index
- Row group index
- Non-equivalent join
- Bloom filter index
- Equivalent connection
Handling of small files
Data skew
- Essence: evenly distribute the data to each reduce
Parameter adjustment
map端预聚合,相当于combiner
- hive.map.aggr=true;
启动两个map job,第一个随机分配map结果,局部聚合;第二个最终聚合
- hive.groupby.skewindata=true;
开启倾斜关联
(Runtime/compile time),开启union的优化
(avoid secondary reading and writing), and set the threshold number for judging key tilt- set hive.optimize.skewjoin=true;
- set hive.optimize.skewjoin.compiletime=true;
- set hive.optimize.union.remove=true;
- set hive.skewjoin.key=100000; The default value is 100000.
SQL adjustment
- join
- Use a uniformly distributed table as a driving table,
做好列裁剪和过滤
- Table small key concentration
- Advanced memory for small tables
- The table is large and divided into buckets, but there are too many special values
空值的key变字符串+随机数分配到不同的reduce中,null关联不上,不影响结果
- Use a uniformly distributed table as a driving table,
- group by-dimension is too small, some values are too many
- count distinct-too many special values
- The value of null is handled separately, in the union
Optimizer
SQL optimization
- Column value clipping
- Constant folding
- Predicate pushdown
Small table
- Start mapJoin optimization, adjust the size of mapjoin, reduce shuffle, the default is 20mb, can be adjusted larger
- set hive.auto.convert.join=true;
- set hive.auto.convert.join.noconditionaltask.size=512000000
- The principle of mapjoin is roughly as follows:
- When performing a join operation on two tables, two unreasonable mappers are generally used to sort on the join key and then generate temporary files. Reduce uses these files as input to perform the join operation. When one table is large and the other table is small, it is not good enough. At this time, there will be thousands of mappers to read different data of the big table, and these mappers also need to go to hdfs to read the data of the small table to the local memory, which may cause a performance bottleneck.
- The optimization of mapjoin is that before the start of the mapreduce task, a local task and a small table are created
hshtable的形式加载到内存
, and then serialized to disk, and the memory hashtable is compressed into a tar file. Then distribute the fileHadoop Distributed Cache
to each mapper, and the mapper deserializes the file locally and loads it into the memory to do the join
Large table related to the middle table:
- Bucket-MapJoin is enabled, the number of buckets in one table is an integer multiple of the other bucket, you can hash both tables and then join
- set hive.optimize.bucketmapjoin = true;
- The number of buckets in one table is an integer multiple of the number of buckets in another table
- bucket column == join column
- Must be used in map join scenarios
- If the table is not bucket, just do a normal join.
Big table related to big table
- SMBJoin is turned on, based on the ordered bucket table
- set hive.optimize.bucketmapjoin = true;
- set hive.auto.convert.sortmerge.join=true;
- set hive.optimize.bucketmapjoin.sortedmerge = true;
- set hive.auto.convert.sortmerge.join.noconditionaltask=true;
- The number of buckets in the small table = the number of buckets in the large table
- Bucket column == Join column == sort column
- Must be used in bucket mapjoin scenarios