PySpark 连接 HBase

曾经我一直在想Spark怎么连接HBase, Spark连接Hive很容易，但是我就是喜欢Spark连接HBase,Hive跑mapreduce执行sql本身执行很慢，所以我一直不太愿意用Hive,我一直追求者性能的优越, 尽管我不清楚Hive建立Hbase外表性能如何。

Spark 想要连接 HBase(环境已OK),

1. Spark配置 hbase jar包:

   mkdir    $SPARK_HOME/jars/hbase
   cp  $HBASE_HOME/lib/hbase*    $SPARK_HOME/jars/hbase -rf 
   cp  $HBASE_HOME/lib/guava-12.0.1.jar   $SPARK_HOME/jars/hbase
   cp  $HBASE_HOME/lib/htrace-core-3.2.0-incubating.jar  $SPARK_HOME/jars/hbase
   cp  $HBASE_HOME/lib/protobuf-java-2.5.0.jar $SPARK_HOME/jars/hbase
   download spark-exmaple...tar:
   https://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.11/1.6.0-typesafe-001
   cp $HBASE_HOME/lib/spark-examples_2.11-1.6.0-typesafe-001.jar 
                $SPARK_HOME/jars/hbase

2. Spark 配置环境变量:

spark-env.sh
export SPARK_DIST_CLASSPATH=$(/home/kfk/hadoop-2.6.0-cdh5.9.3/bin/hadoop classpath):$(/home/kfk/hbase-1.2.0-cdh5.9.3/bin/hbase classpath):/home/kfk/spark-2.4.0-bin-hadoop2.6/jars/hbase/*

3. 拷贝hbase-site.xml 到 $SPARK_HOME/conf 下面(我没验证过这步是否是必要的)

这样Spark就配置完成了，接下来就算pyspark连接hbase读数据代码:

host = 'master'
table = 'job'
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyC = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valC = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyC,、
valueConverter=valC,
conf=conf)

Pyspark 连接hbase写数据:

host='master'
table='job'
keyC= "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valC = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,    "mapreduce.outputformat.class":"org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class":"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
rawData = ['0010,col,name,wujqi','0010,col,age,18']
#data format: (rowkey, [rowkey,family,col name, value])
sc.parallelize(rawData).map(lambda x: (x[0],x.split(',')))\
.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyC,valueConverter=valC)

如果想要使用Hbase thrift 服务连接 Hbase 请参考:

猜你喜欢