hive-combing and optimizing from the bottom

to sum up

  • ps: The summary of this paragraph is just to facilitate my overall understanding, if something is unreasonable, welcome guidance
  • For a piece of sql, we can analyze it

SQL execution order

  • (7) SELECT
  • (8) DISTINCT <select_list>
  • (1) FROM <left_table>
  • (3) <join_type> JOIN <right_table>
  • (2) ON <join_condition>
  • (4) WHERE <where_condition>
  • (5) GROUP BY <group_by_list>
  • (6) HAVING <having_condition>
  • (9) ORDER BY <order_by_condition>
  • (10) LIMIT <limit_number>
  • FROM ->on-> join-> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> UNION -> ORDER BY-> limit

What these keywords determine

  • select 1 from 2 where 3 group by 4 having 5 order by 6 limit 7
    • 1- Determines which columns are in the result: either an existing column or a column generated by a function, column filtering
    • 2- Determine the data source to read the data
    • 3- Decided which rows to filter
    • 4- According to what conditions to group
    • 5- which rows to filter after grouping
    • 6- Sort according to what conditions
    • 7-Limit output

How to optimize the correspondence?

  • Hive converts a query into one or more stages. Such stages can be MapReduce stage, sampling stage, merge stage, limit stage.
  • We know that the bottom layer of hive is mapreduce. For hive, optimization can be divided into two directions to optimize, namely sql optimization and parameter optimization, which 本质大部分都是针对mapreduce进行优化, for example, reduce the amount of data, thereby reducing data transmission; such as a reasonable number of maps and reduce numbers , So as to execute in parallel.

First of all, for the input stage of mr

  • Reduce the amount of data
  • If it is a partitioned table, perform data query to specify the partition, that is, as far as possible 在where中使用分区字段,减少数据量
  • If there is a temporary table in the logic, do a good job of column clipping ( 即只保留我们需要的列),减少数据量
  • If you have the 先过滤在joinJoin, ,减少数据量

Second, the map stage for mr

  • A small file will start a map. If there are many small files, the time for starting and initializing a map task may be much longer than the logic processing time. That is to say, we can merge small files 减少map数and configure one combine参数可以执行前进行小文将合并, one merge参数执行mr结束时合并小文件, and at the same time. The size of the smallest/large slice can be configured (refer to https://blog.csdn.net/qq_46893497/article/details/113864209 )
  • Of course, there are also files of about 128m that only start one map, but the logic is very complicated. At this time, we need to split this file into multiple reasonably 用多个mapto complete the task and configure mr task数.
  • You can also open the map side推测执行
  • Can also be turned onmapjoin

Third, the shuffle stage for mr

  • Configuration 压缩参数,减少网络输出
  • Try to specify the conditions when associating,减少笛卡尔积的产生
  • Can design bucket table

Fourth, the reduce phase for mr

  • Adjust the appropriate number of redcers
    • By setting 每个reducer处理的字节数, or task数, avoid too many small files
    • Replace distinct with group by first
  • Distribute the data as evenly as possible to each reduce
    • Start two mr, pass parameters skewjoin, first randomize, and then aggregate
    • For too many keys in on, you can add a prefix
    • 空值的key变字符串+随机数Assigned to different reduce, null is not related and does not affect the result
    • Can also be turned onreduce端的推测执行

Fifth, the output stage for mr

  • merge参数执行mr结束时合并小文件
  • Configuration 压缩参数, reduce data storage space

Sixth, focus on the overall situation

  • Open mr推测执行
  • Open mrjvm重用
  • Configure compression
  • Turn on并行执行

*********************************The following directions can be studied in depth ************** *********************

Configuration optimization

  • Optimization of SQL limit
  • The execution engine is changed to spark-fast query

Compression configuration optimization

  • Map-side output reduces network transmission-whether it can be split
  • Shuffle process reduces network transmission-fast
  • Reduce the output of the reduce end to reduce the storage space-compression ratio

Zipper table use

Bucketing

  • The partition is further split into buckets to obtain higher query efficiency
  • You can't load it directly, you need to turn on the bucket parameter first, then create a temporary table, and then insert

Parallel operation

  • By default, only a section of HiveSQL can be compiled and locked
  • Parallelism can be turned on to ensure simultaneous compilation and maximum parallelism

index

  • Row group index
    • Non-equivalent join
  • Bloom filter index
    • Equivalent connection

Handling of small files

Data skew

  • Essence: evenly distribute the data to each reduce

Parameter adjustment

  • map端预聚合,相当于combiner
    • hive.map.aggr=true;
  • 启动两个map job,第一个随机分配map结果,局部聚合;第二个最终聚合
    • hive.groupby.skewindata=true;
  • 开启倾斜关联(Runtime/compile time), 开启union的优化(avoid secondary reading and writing), and set the threshold number for judging key tilt
    • set hive.optimize.skewjoin=true;
    • set hive.optimize.skewjoin.compiletime=true;
    • set hive.optimize.union.remove=true;
    • set hive.skewjoin.key=100000; The default value is 100000.

SQL adjustment

  • join
    • Use a uniformly distributed table as a driving table,做好列裁剪和过滤
    • Table small key concentration
      • Advanced memory for small tables
    • The table is large and divided into buckets, but there are too many special values
      • 空值的key变字符串+随机数分配到不同的reduce中,null关联不上,不影响结果
  • group by-dimension is too small, some values ​​are too many
  • count distinct-too many special values
    • The value of null is handled separately, in the union

Optimizer

SQL optimization

  • Column value clipping
  • Constant folding
  • Predicate pushdown

Small table

  • Start mapJoin optimization, adjust the size of mapjoin, reduce shuffle, the default is 20mb, can be adjusted larger
    • set hive.auto.convert.join=true;
    • set hive.auto.convert.join.noconditionaltask.size=512000000
  • The principle of mapjoin is roughly as follows:
    • When performing a join operation on two tables, two unreasonable mappers are generally used to sort on the join key and then generate temporary files. Reduce uses these files as input to perform the join operation. When one table is large and the other table is small, it is not good enough. At this time, there will be thousands of mappers to read different data of the big table, and these mappers also need to go to hdfs to read the data of the small table to the local memory, which may cause a performance bottleneck.
  • The optimization of mapjoin is that before the start of the mapreduce task, a local task and a small table are created hshtable的形式加载到内存, and then serialized to disk, and the memory hashtable is compressed into a tar file. Then distribute the file Hadoop Distributed Cacheto each mapper, and the mapper deserializes the file locally and loads it into the memory to do the join

Large table related to the middle table:

  • Bucket-MapJoin is enabled, the number of buckets in one table is an integer multiple of the other bucket, you can hash both tables and then join
    • set hive.optimize.bucketmapjoin = true;
    • The number of buckets in one table is an integer multiple of the number of buckets in another table
    • bucket column == join column
    • Must be used in map join scenarios
    • If the table is not bucket, just do a normal join.

Big table related to big table

  • SMBJoin is turned on, based on the ordered bucket table
    • set hive.optimize.bucketmapjoin = true;
    • set hive.auto.convert.sortmerge.join=true;
    • set hive.optimize.bucketmapjoin.sortedmerge = true;
    • set hive.auto.convert.sortmerge.join.noconditionaltask=true;
  • The number of buckets in the small table = the number of buckets in the large table
  • Bucket column == Join column == sort column
  • Must be used in bucket mapjoin scenarios

Guess you like

Origin blog.csdn.net/qq_46893497/article/details/114047447