hive-索引(加优化)

1,索引

hive只有有限的索引功能，hive中没有普通关系型数据库中键的概念，但是还是可以对一些字段建立索引来加速某些操作的，一张表的索引数据在另一张表中，说到索引我们也可以理解为这是hive提供的优化功能。他可以减少MapReduce的输入数据量，因为在索引表中他把每个字段的索引和偏移量都计算出来，可以说查找的速度是很快了，尤其是大数据集

1，创建索引我们有一个现成的表zxz_5.
创建格式:
create index zxz_5_index
on table zxz_5 (nid)
as 'bitmap'
with deferred rebuild

注意:as 后面跟的是索引处理器，bitmap处理器普遍应用于排重后值较小的列;
    with deferred rebuild  他是必须要填加的
2，我们建立出来的索引表可以show tables; 查看

他的默认显示就是:default__zxz_5_zxz_5_index__ 后面是我们指定的索引表名他们是连在一块的

3，desc show 这个索引表,里面就是指定的索引列，和一些桶字段和偏移量；如果你指定的表是分区表，那么他会显示分区索引而不是全局索引了。

4，我们要把重建索引表才能得到索引数据；

普通重建索引: alter index zxz_5_index on zxz_5 rebuild

分区重建索引:alter index zxz_5_index on zxz_5 partition (year="2018") rebuild

5,我们显示一些索引表的信息:

show fromated index on zxz_5;

6,删除索引表:

drop index zxz_5_index on table zxz_5; 注意：如果我们把原表删除索引表会自动删除

2，hive优化:

MapReduce
-------------------
   Map : map -> partition -> sortAndSpill() --> Combiner
   hive.exec.compress.output=false                       //输出文件是否压缩,默认false
   hive.exec.compress.intermediate=false               //启用中间文件是否压缩,默认false
   hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec   //设置压缩编解码器,默认空
   hive.intermediate.compression.type                   //压缩类型

hive调优
------------------
   1.使用explain解析查询结果
       $beeline>explain [extended] select sum(id) from customers ;

limit优化
--------------------
1.
<property>
<name>hive.limit.optimize.enable</name>
<value>false</value>
<description>Whether to enable to optimization to trying a smaller subset of data for simple LIMIT first.</description>
</property>

<property>
<name>hive.limit.row.max.size</name>
<value>100000</value>
<description>When trying a smaller subset of data for simple LIMIT, how much size we need to guarantee each row to have at least.</description>
</property>
<property>
<name>hive.limit.optimize.limit.file</name>
<value>10</value>
<description>When trying a smaller subset of data for simple LIMIT, maximum number of files we can sample.</description>
</property>
<property>
<name>hive.limit.optimize.fetch.max</name>
<value>50000</value>
<description>
Maximum number of rows allowed for a smaller subset of data for simple LIMIT, if it is a fetch query.
Insert queries are not restricted by this limit.
</description>
</property>
<property>
<name>hive.limit.pushdown.memory.usage</name>
<value>0.1</value>
<description>
Expects value between 0.0f and 1.0f.
The fraction of available memory to be used for buffering rows in Reducesink operator for limit pushdown optimization.
</description>
</property>
<property>
<name>hive.limit.query.max.table.partition</name>
<value>-1</value>
<description>
This controls how many partitions can be scanned for each partitioned table.
The default value "-1" means no limit.
</description>
</property>

hadoop
-------------------
   1.local
       nothing!
       不需要启动单独进程。
       所有的java程序都在一个jvm中运行。

   2.伪分布式

   3.完全分布式


本地模式:
------------------------
   hive.exec.mode.local.auto=true                       //
   hive.exec.mode.local.auto.inputbytes.max=134217728   //
   hive.exec.mode.local.auto.input.files.max=4           //

JVM重用
------------------------
   [不推荐]
   SET mapred.job.reuse.jvm.num.tasks=5;               //在mapreduce-1使用，yarn不适用。
   com.it18zhang.myhadoop273_1211.join.reduce.App

   [yarn]
   //mapred-site.xml
   mapreduce.job.ubertask.enable=false                   //启用当个jvm按序一些列task,默认false
   mapreduce.job.ubertask.maxmaps=9                   //最大map数>=9,只能调低。
   mapreduce.job.ubertask.maxreduces=1                   //目前只支持1个reduce.
   mapreduce.job.ubertask.maxbytes=128m               //

并发执行
-------------------------
   explain解释执行计划，对于没有固定依赖关系的task，
   可以进行并发执行。
   hive.exec.parallel=true               //启用mr的并发执行，默认false
   hive.exec.parallel.thread.number=8   //设置并发执行的job数，默认是8.

map端连接
------------------------
   SET hive.auto.convert.join=true;                   //
   SET hive.mapjoin.smalltable.filesize=600000000;       //文件<= 指定值时可以启用map连接。
   SET hive.auto.convert.join.noconditionaltask=true;   //不需要在select中使用/*+ streamtable(customers) */暗示.

map bucket端连接
-------------------------
   SET hive.auto.convert.join=true; --default false       //
   SET hive.optimize.bucketmapjoin=true; --default false   //

SkewJoin
-------------------------
   倾斜连接.
   SET hive.optimize.skewjoin=true;       //开启倾斜优化
   SET hive.skewjoin.key=100000;           //key量超过该值，新的key发送给未使用的reduce。
   SET hive.groupby.skewindata=true;       //在groupby中使用应用数据倾斜优化，默认false.

analyze
-----------------------
   对表、partition,column level级别元数据进行统计，作为input传递给
   CBO(cost-based Optimizer)，会选择成本最低查询计划来执行。
   analyze table customers compute statictics ;
   desc extended customers ;

beeline
---------------------------
   beeline -u jdbc:hive2://           //运行在本地模式下，没有启动hiveserver2服务器。

create table tt (id int,hobbies array<String>,addr struct<province:string,city:string,street:string>,scores map<string,int> ) row format delimited fields terminated by ' ' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n' stored as textfile ;

insert into tt values(1,array('1','2','3'),struct('province':"hebei",'city':'baoding','street':'renmin'),map('a':100,'b':200));

create table stru(id int,a struct<p1:string,p2:string>) row format delimited ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' STORED AS TEXTFILE;

create table map(id int ,a map<string,int>) row format delimited fields terminated by ' ' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n' stored as textfile ;

猜你喜欢