如何 map 端 Join。

Hive 中  修改表的 rawDataSize = 1
14: jdbc:hive2://ark3:9994> alter table app_uuid_info_test  set tblproperties(rawDataSize=1)
14: jdbc:hive2://ark3:9994> ;
HBASE 表是不会根新的所有手工指点


这个 阀值
set spark.sql.autoBroadcastJoinThreshold=100000000;

JOINT  中  left  outer join app_uuid_info_test b

select   a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.value.int_20 from tmp.ods_app_hour a left  outer join app_uuid_info_test b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303'  and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100;


执行期间中 :  就是  MAP JOIN 了。


org.apache.spark.sql.execution.SparkStrategies类 决定是否使用broadcast join的逻辑在SparkStrategies类中,


object CanBroadcast { def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match { case BroadcastHint(p) => Some(p) case p if sqlContext. conf.autoBroadcastJoinThreshold > 0 && p.statistics.sizeInBytes <= sqlContext.conf.autoBroadcastJoinThreshold => Some(p) case _ => None } }




== Parsed Logical Plan ==
'GlobalLimit 100
+- 'LocalLimit 100
   +- 'Project ['a.tmp_id, 'a.uuid, 'a.eguan_id, 'a.device_id, 'b.row_key, 'b.value.int_20]
      +- 'Filter ((('a.app_id = 6069) && ('a.day = 20180303)) && ('a.tmp_id = 1a16d393e7042213384f994394b763d37121d7))
         +- 'Join LeftOuter, ('a.tmp_id = 'b.row_key)
            :- 'UnresolvedRelation `tmp`.`ods_app_hour`, a
            +- 'UnresolvedRelation `app_uuid_info_test`, b

== Analyzed Logical Plan ==
tmp_id: string, uuid: bigint, eguan_id: string, device_id: string, row_key: string, int_20: string
GlobalLimit 100
+- LocalLimit 100
   +- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
      +- Filter (((app_id#22014 = 6069) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
         +- Join LeftOuter, (tmp_id#22017 = row_key#22195)
            :- SubqueryAlias a
            :  +- SubqueryAlias ods_app_hour
            :     +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
            +- SubqueryAlias b
               +- MetastoreRelation default, app_uuid_info_test

== Optimized Logical Plan ==
GlobalLimit 100
+- LocalLimit 100
   +- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
      +- Join LeftOuter, (tmp_id#22017 = row_key#22195)
         :- Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
         :  +- Filter (((((isnotnull(app_id#22014) && isnotnull(day#22015)) && isnotnull(tmp_id#22017)) && (app_id#22014 = 6069)) && (day#22015 = 20180303)) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
         :     +- Relation[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,device_mac#22021,device_imsi#22022,device_aid#22023,device_snr#22024,device_udid#22025,debug_state#22026,hjk_state#22027,sir_state#22028,ij_state#22029,sdk_api_level#22030,standard_brand_id#22031,standard_model_id#22032,os_name_id#22033,os_version_id#22034,standard_smo_id#22035,app_key#22036,app_channel#22037,sdk_version#22038,app_version#22039,imeis_code#22040,... 157 more fields] parquet
         +- MetastoreRelation default, app_uuid_info_test

== Physical Plan ==
CollectLimit 100
+- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020, row_key#22195, value#22196[int_20] AS int_20#22197]
   +- *BroadcastHashJoin [tmp_id#22017], [row_key#22195], LeftOuter, BuildRight
      :- *Project [tmp_id#22017, uuid#22018L, eguan_id#22019, device_id#22020]
      :  +- *Filter (isnotnull(tmp_id#22017) && (tmp_id#22017 = 1a16d393e7042213384f994394b763d37121d7))
      :     +- *FileScan parquet tmp.ods_app_hour[tmp_id#22017,uuid#22018L,eguan_id#22019,device_id#22020,app_id#22014,day#22015,hour#22016] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#22014), isnotnull(day#22015), (app_id#22014 = 6069), (day#22015 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
         +- HiveTableScan [row_key#22195, value#22196], MetastoreRelation default, app_uuid_info_test
2018-03-04 19:45:47,554 INFO  [dispatcher-event-loop-10] storage.BlockManagerInfo: Removed broadcast_173_piece0 on 192.168.220.171:38696 in memory (size: 33.7 KB, free: 334.5 MB)


维表写法

select a.tmp_id,a.uuid,a.eguan_id,a.device_id,b.row_key,b.int_20 from tmp.ods_app_hour a left outer join (select row_key,value.int_20 as int_20 from    app_uuid_info_test  where row_key in (01,02) )     b on a.tmp_id = b.row_key where a.app_id='6069' and a.day = '20180303' and a.tmp_id = '1a16d393e7042213384f994394b763d37121d7' limit 100



== Physical Plan ==
CollectLimit 100
+- *BroadcastHashJoin [tmp_id#23309], [row_key#23487], LeftOuter, BuildRight
   :- *LocalLimit 100
   :  +- *Project [tmp_id#23309, uuid#23310L, eguan_id#23311, device_id#23312]
   :     +- *Filter (isnotnull(tmp_id#23309) && (tmp_id#23309 = 1a16d393e7042213384f994394b763d37121d7))
   :        +- *FileScan parquet tmp.ods_app_hour[tmp_id#23309,uuid#23310L,eguan_id#23311,device_id#23312,app_id#23306,day#23307,hour#23308] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://mycluster/user/hive/warehouse/tmp.db/ods_app_hour/app_id=6069/day=..., PartitionCount: 24, PartitionFilters: [isnotnull(app_id#23306), isnotnull(day#23307), (app_id#23306 = 6069), (day#23307 = 20180303)], PushedFilters: [IsNotNull(tmp_id), EqualTo(tmp_id,1a16d393e7042213384f994394b763d37121d7)], ReadSchema: struct<tmp_id:string,uuid:bigint,eguan_id:string,device_id:string>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      +- *Project [row_key#23487, value#23488[int_20] AS int_20#22932]
         +- *Filter row_key#23487 IN (1,2)
            +- HiveTableScan [row_key#23487, value#23488], MetastoreRelation default, app_uuid_info_test


猜你喜欢

转载自lingzhi007.iteye.com/blog/2412153