sparksql结合hive

第一步：在spark的conf目录下创建hive的配置文件的信息
/usr/local/spark/conf 创建文件hive-site.xml
里面的内容是：

<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.17.108:9083</value>
</property>
</configuration>

说明：在本机的hive中的hive-site.xml 不要修改，我就做了这件错事

第二步：启动hive的元数据

/usr/local/hive/bin  执行
./hive --service metastore
nohup hive --service metastore  > /dev/null 2>&1 &    ---后台执行

说明：1 因为配置了环境变量，可以直接在目录中运行
2 在启动这步时，把mysql，hive都先启动下，防止未知错误
第三步： /usr/local/spark/bin spark-sql
此时就可以用hive命令进行验证了

在hive中执行, 如下所示:

hive> select count(1) from student;
Query ID = root_20190308094705_cc7c65ac-a870-4e54-aba7-635982001eee
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1552009015279_0001, Tracking URL = http://hadoop:8088/proxy/application_1552009015279_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1552009015279_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-03-08 09:47:33,589 Stage-1 map = 0%,  reduce = 0%
2019-03-08 09:47:53,266 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.49 sec
2019-03-08 09:48:05,512 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.88 sec
MapReduce Total cumulative CPU time: 6 seconds 880 msec
Ended Job = job_1552009015279_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.88 sec   HDFS Read: 7123 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 880 msec
OK
10
Time taken: 61.877 seconds, Fetched: 1 row(s)

可以看到，用了61.877秒获得了10条数据

在spark-sql中执行，如下所示:

spark-sql> select count(1) from student;
19/03/08 10:01:00 INFO ParseDriver: Parsing command: select count(1) from student
19/03/08 10:01:00 INFO ParseDriver: Parse Completed
19/03/08 10:01:01 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 521.4 KB, free 524.5 KB)
19/03/08 10:01:01 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 41.5 KB, free 566.0 KB)
19/03/08 10:01:01 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:40343 (size: 41.5 KB, free: 517.4 MB)
19/03/08 10:01:01 INFO SparkContext: Created broadcast 1 from processCmd at CliDriver.java:376
19/03/08 10:01:02 INFO FileInputFormat: Total input paths to process : 1
19/03/08 10:01:02 INFO SparkContext: Starting job: processCmd at CliDriver.java:376
19/03/08 10:01:02 INFO DAGScheduler: Registering RDD 6 (processCmd at CliDriver.java:376)
19/03/08 10:01:02 INFO DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 1 output partitions
19/03/08 10:01:02 INFO DAGScheduler: Final stage: ResultStage 2 (processCmd at CliDriver.java:376)
19/03/08 10:01:02 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
19/03/08 10:01:02 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
19/03/08 10:01:02 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376), which has no missing parents
19/03/08 10:01:02 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 13.6 KB, free 579.5 KB)
19/03/08 10:01:02 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 6.8 KB, free 586.4 KB)
19/03/08 10:01:02 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:40343 (size: 6.8 KB, free: 517.4 MB)
19/03/08 10:01:02 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
19/03/08 10:01:02 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376)
19/03/08 10:01:02 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
19/03/08 10:01:02 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,ANY, 2151 bytes)
19/03/08 10:01:02 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
19/03/08 10:01:02 INFO HadoopRDD: Input split: hdfs://hadoop:9000/usr/hive/warehouse/zxc.db/student/stu.txt:0+239
19/03/08 10:01:02 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
19/03/08 10:01:02 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
19/03/08 10:01:02 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
19/03/08 10:01:02 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
19/03/08 10:01:02 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
19/03/08 10:01:03 INFO GenerateMutableProjection: Code generated in 554.72438 ms
19/03/08 10:01:03 INFO GenerateUnsafeProjection: Code generated in 33.316274 ms
19/03/08 10:01:04 INFO GenerateMutableProjection: Code generated in 30.84288 ms
19/03/08 10:01:04 INFO GenerateUnsafeRowJoiner: Code generated in 34.216625 ms
19/03/08 10:01:04 INFO GenerateUnsafeProjection: Code generated in 33.597192 ms
19/03/08 10:01:04 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2424 bytes result sent to driver
19/03/08 10:01:04 INFO DAGScheduler: ShuffleMapStage 1 (processCmd at CliDriver.java:376) finished in 1.629 s
19/03/08 10:01:04 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1630 ms on localhost (1/1)
19/03/08 10:01:04 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/03/08 10:01:04 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@24c88e33
19/03/08 10:01:04 INFO StatsReportListener: task runtime:(count: 1, mean: 1630.000000, stdev: 0.000000, max: 1630.000000, min: 1630.000000)
19/03/08 10:01:04 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:04 INFO StatsReportListener: 	1.6 s	1.6 s	1.6 s	1.6 s	1.6 s	1.6 s	1.6 s	1.6 s	1.6 s
19/03/08 10:01:04 INFO DAGScheduler: looking for newly runnable stages
19/03/08 10:01:04 INFO DAGScheduler: running: Set()
19/03/08 10:01:04 INFO StatsReportListener: shuffle bytes written:(count: 1, mean: 42.000000, stdev: 0.000000, max: 42.000000, min: 42.000000)
19/03/08 10:01:04 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:04 INFO StatsReportListener: 	42.0 B	42.0 B	42.0 B	42.0 B	42.0 B	42.0 B	42.0 B	42.0 B	42.0 B
19/03/08 10:01:04 INFO DAGScheduler: waiting: Set(ResultStage 2)
19/03/08 10:01:04 INFO StatsReportListener: task result size:(count: 1, mean: 2424.000000, stdev: 0.000000, max: 2424.000000, min: 2424.000000)
19/03/08 10:01:04 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:04 INFO StatsReportListener: 	2.4 KB	2.4 KB	2.4 KB	2.4 KB	2.4 KB	2.4 KB	2.4 KB	2.4 KB	2.4 KB
19/03/08 10:01:04 INFO DAGScheduler: failed: Set()
19/03/08 10:01:04 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[9] at processCmd at CliDriver.java:376), which has no missing parents
19/03/08 10:01:04 INFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 91.533742, stdev: 0.000000, max: 91.533742, min: 91.533742)
19/03/08 10:01:04 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:04 INFO StatsReportListener: 	92 %	92 %	92 %	92 %	92 %	92 %	92 %	92 %	92 %
19/03/08 10:01:04 INFO StatsReportListener: other time pct: (count: 1, mean: 8.466258, stdev: 0.000000, max: 8.466258, min: 8.466258)
19/03/08 10:01:04 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:04 INFO StatsReportListener: 	 8 %	 8 %	 8 %	 8 %	 8 %	 8 %	 8 %	 8 %	 8 %
19/03/08 10:01:04 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 12.7 KB, free 599.0 KB)
19/03/08 10:01:04 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.3 KB, free 605.4 KB)
19/03/08 10:01:04 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:40343 (size: 6.3 KB, free: 517.4 MB)
19/03/08 10:01:04 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
19/03/08 10:01:04 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[9] at processCmd at CliDriver.java:376)
19/03/08 10:01:04 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
19/03/08 10:01:04 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,NODE_LOCAL, 1999 bytes)
19/03/08 10:01:04 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
19/03/08 10:01:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
19/03/08 10:01:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 12 ms
19/03/08 10:01:04 INFO GenerateMutableProjection: Code generated in 23.664575 ms
19/03/08 10:01:05 INFO GenerateMutableProjection: Code generated in 69.765155 ms
19/03/08 10:01:05 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1664 bytes result sent to driver
19/03/08 10:01:05 INFO DAGScheduler: ResultStage 2 (processCmd at CliDriver.java:376) finished in 1.066 s
19/03/08 10:01:05 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@17cada35
19/03/08 10:01:05 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 2.869384 s
19/03/08 10:01:05 INFO StatsReportListener: task runtime:(count: 1, mean: 1070.000000, stdev: 0.000000, max: 1070.000000, min: 1070.000000)
19/03/08 10:01:05 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:05 INFO StatsReportListener: 	1.1 s	1.1 s	1.1 s	1.1 s	1.1 s	1.1 s	1.1 s	1.1 s	1.1 s
19/03/08 10:01:05 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 1070 ms on localhost (1/1)
19/03/08 10:01:05 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
19/03/08 10:01:05 INFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
19/03/08 10:01:05 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:05 INFO StatsReportListener: 	0.0 ms	0.0 ms	0.0 ms	0.0 ms	0.0 ms	0.0 ms	0.0 ms	0.0 ms	0.0 ms
19/03/08 10:01:05 INFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
19/03/08 10:01:05 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:05 INFO StatsReportListener: 	0.0 B	0.0 B	0.0 B	0.0 B	0.0 B	0.0 B	0.0 B	0.0 B	0.0 B
19/03/08 10:01:05 INFO StatsReportListener: task result size:(count: 1, mean: 1664.000000, stdev: 0.000000, max: 1664.000000, min: 1664.000000)
19/03/08 10:01:05 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:05 INFO StatsReportListener: 	1664.0 B	1664.0 B	1664.0 B	1664.0 B	1664.0 B1664.0 B	1664.0 B	1664.0 B	1664.0 B
19/03/08 10:01:05 INFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 94.392523, stdev: 0.000000, max: 94.392523, min: 94.392523)
19/03/08 10:01:05 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:05 INFO StatsReportListener: 	94 %	94 %	94 %	94 %	94 %	94 %	94 %	94 %	94 %
10
Time taken: 4.686 seconds, Fetched 1 row(s)
19/03/08 10:01:05 INFO CliDriver: Time taken: 4.686 seconds, Fetched 1 row(s)
spark-sql> 19/03/08 10:01:05 INFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
19/03/08 10:01:05 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:05 INFO StatsReportListener: 	 0 %	 0 %	 0 %	 0 %	 0 %	 0 %	 0 %	 0 %	 0 %
19/03/08 10:01:05 INFO StatsReportListener: other time pct: (count: 1, mean: 5.607477, stdev: 0.000000, max: 5.607477, min: 5.607477)
19/03/08 10:01:05 INFO StatsReportListener: 	0%	5%	10%	25%	50%	75%	90%	95%	100%
19/03/08 10:01:05 INFO StatsReportListener: 	 6 %	 6 %	 6 %	 6 %	 6 %	 6 %	 6 %	 6 %	 6 %

         > ;
spark-sql>

可以看到，用了4.686秒的时间返回了10行数据，而hive中的同样语句的执行时间是61.877秒。所以，在spark-sql执行hql语句的效率是远大于在hive中执行hsql的。

说明：

1 在spark-sql执行完之后，会显示一直在等待的状态，会误以为是一直在运行的。解决办法就是回车之后，按下分号”;” 可以返回到spark-sql的命令行中

猜你喜欢