1.Kudu安装,先建议全部使用root安装
在/etc/apt/sources.list.d目录下,先备份移除ambari-hdp1.list,以及其他HDP相关的仓库信息,
再新增文件cloudera.list,内容如下
# Packages for Cloudera's Distribution for Hadoop, Version 5, on Ubuntu 16.04 amd64
deb [arch=amd64] http://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu xenial-kudu5 contrib
deb-src http://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu xenial-kudu5 contrib
运行下面四个命令
#>cd /opt
#>wget https://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu/archive.key -O archive.key
#>sudo apt-key add archive.key
#>apt-get update
使用root在线安装
apt-get install kudu # Base Kudu files
apt-get install kudu-master # Service scripts for managing kudu-master
apt-get install kudu-tserver # Service scripts for managing kudu-tserver
apt-get install libkuduclient0 # Kudu C++ client shared library
apt-get install libkuduclient-dev # Kudu C++ client SDK
3.3 启动服务
sudo service kudu-master start
sudo service kudu-tserver start
3.4 打开web检查
浏览器打开 http://localhost:8051/
上面安装的kudu是1.4版本。
参考链接:https://blog.csdn.net/weijiasheng/article/details/104796332
https://docs.cloudera.com/documentation/kudu/5-12-x/topics/kudu_installation.html
1.1 impala安装,能安装,但运行不了,由于花费时间太多,建议在CentOS下进行试验
在cloudera.list,添加如下内容
![](/qrcode.jpg)
deb [arch=amd64] http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala2 contrib
deb-src http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala2 contrib
apt-get install bigtop-utils
apt-get install impala
apt-get install impala-server
apt-get install impala-state-store
apt-get install impala-catalog
apt-get install python-setuptools
apt-get install impala-shell
# cd /usr/local/hadoop-2.0.0-cdh4.1.0/etc/hadoop/
# cp core-site.xml hdfs-site.xml /etc/impala/conf
# cd /etc/impala/conf
# vi hdfs-site.xml
增加:
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
<property>
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader.local</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>750</value>
</property>
<property>
<name>dfs.block.local-path-access.user</name>
<value>impala</value>
</property>
<property>
<name>dfs.client.file-block-storage-locations.timeout</name>
<value>3000</value>
</property>
vim /etc/default/impala
增加如下内容
IMPALA_CATALOG_SERVICE_HOST=127.0.0.1
IMPALA_STATE_STORE_HOST=127.0.0.1
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/var/log/impala
IMPALA_CATALOG_ARGS=" -log_dir=${IMPALA_LOG_DIR} "
IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}"
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT} "
ENABLE_CORE_DUMPS=false
# LIBHDFS_OPTS=-Djava.library.path=/usr/lib/impala/lib
#
MYSQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java-8.0.13.jar
IMPALA_BIN=/usr/lib/impala/sbin
IMPALA_HOME=/usr/lib/impala
# HIVE_HOME=/usr/lib/hive
# HBASE_HOME=/usr/lib/hbase
IMPALA_CONF_DIR=/etc/impala/conf
HADOOP_CONF_DIR=/etc/impala/conf
#
HIVE_CONF_DIR=/etc/impala/conf
删除
cd /usr/lib/python2.7/dist-packages/
rm -rf setuptools*
service impala-state-store restart --kudu_master_hosts=master:7051
service impala-catalog restart --kudu_master_hosts=master:7051
service impala-server restart --kudu_master_hosts=master:7051
已经耗费两天,全部失败,impala占用21000端口失败。具体原因等后续有空再解决
建议使用CentOS,装impala-kudu,参考:https://www.51dev.com/javascript/14324
3.安装spark,spark 2.4.5的安装请参考我写的一篇博文,2.3.4类似
https://blog.csdn.net/penker_zhao/article/details/102568564
具体下载地址:https://archive.apache.org/dist/spark/spark-2.3.4/
4.使用Scala,Spark-Hbase,访问HBase数据库
scala 版本2.11.12
Spark 2.3.4
HBase 2.0.2
4.1使用./hbase shell进入客户端,建表
hbase(main):019:0>create 'Student','info'
4.2在idea建立scala的maven工程
安装Scala插件
设置Scala SDK
具体参考:
https://blog.csdn.net/hr786250678/article/details/86229959
https://www.cnblogs.com/chuhongyun/p/11400884.html
https://www.cnblogs.com/wangjianwei/articles/9722234.html
具体代码,工程目录如下
pom文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>scalahbasetest</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<hadoop.version>2.7.5</hadoop.version>
<spark.version>2.3.4</spark.version>
<scala.version>2.11.12</scala.version>
<junit.version>4.12</junit.version>
<netty.version>4.1.42.Final</netty.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<!-- spark 核心依赖包 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase.connectors.spark</groupId>
<artifactId>hbase-spark</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<!--
java.lang.NoClassDefFoundError: org/apache/spark/streaming/dstream/DStream
-->
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-all</artifactId>
<version>${netty.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<!-- 编译Scala 的插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.6</version>
</plugin>
<!-- 编译Java 的插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>${maven.compiler.source}</source>
<target>${maven.compiler.target}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<includes>
<include>**/*.scala</include>
</includes>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
HBaseBulkPutExample.scala
package com.example
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.client.{Get, Put, Result}
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.spark.HBaseRDDFunctions._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.TableName
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
/**
* @author 王天赐
* @create 2019-11-29 9:28
*/
object HBaseBulkPutExample extends App {
val tableName = "Student"
val sparkConf = new SparkConf()
.setAppName("HBaseBulkPutExample " + tableName)
.setMaster("local[*]")
val sc = new SparkContext(sparkConf)
try {
//[(Array[Byte])]
val rdd = sc.parallelize(Array(
Array(Bytes.toBytes("B1001"),Bytes.toBytes("name"),Bytes.toBytes("张飞")),
Array(Bytes.toBytes("B1002"),Bytes.toBytes("name"),Bytes.toBytes("李白")),
Array(Bytes.toBytes("B1003"),Bytes.toBytes("name"),Bytes.toBytes("韩信"))))
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "192.168.51.32")
conf.set("hbase.zookeeper.property.clientPort", "2181")
val hbaseContext = new HBaseContext(sc, conf)
val getRdd = rdd.hbaseBulkPut(hbaseContext, TableName.valueOf("Student"),
record => {
val put = new Put(record(0))
put.addColumn(Bytes.toBytes("info"), record(1), record(2));
put
}
)
} finally {
sc.stop()
}
}
HBaseBulkGetExampleByRDD.scala
package com.example
import org.apache.hadoop.hbase.client.{Get, Result}
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.spark.HBaseRDDFunctions._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{Cell, CellUtil, HBaseConfiguration, TableName}
import org.apache.spark.{SparkConf, SparkContext}
/**
* 使用 RDD 作为数据源, 将RDD中的数据写入到HBase
* 特别注意 : 一定要导入 HBase 的隐式方法org.apache.hadoop.hbase.spark.HBaseRDDFunctions._
*
* @author 王天赐
* @create 2019-11-29 19:35
*/
object HBaseBulkGetExampleByRDD extends App{
// 1.创建SparkConf 以及 SparkContext, 设置本地运行模式
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("HBase")
val sc = new SparkContext(conf)
// 设置日志输出等级为 WARN
sc.setLogLevel("WARN")
try {
// 2. 创建HBaseConfiguration对象设置连接参数
val hbaseConf = HBaseConfiguration.create()
// 设置连接参数
hbaseConf.set("hbase.zookeeper.quorum", "192.168.51.32")
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
// 3.创建HBaseContext
val hc = new HBaseContext(sc, hbaseConf)
// 4. 将需要获取的数据的 Rowkey 字段等信息封装到 RDD中
val rowKeyAndQualifier = sc.parallelize(Array(
Array(Bytes.toBytes("B1001"), Bytes.toBytes("name")),
Array(Bytes.toBytes("B1002"), Bytes.toBytes("name")),
Array(Bytes.toBytes("B1003"), Bytes.toBytes("name"))
))
// 5. 获取指定RowKey 以及指定字段的信息
val result = rowKeyAndQualifier.hbaseBulkGet(hc, TableName.valueOf("Student"), 2,
(info) => {
val rowkey = info(0)
// 字段名
val qualify = info(1)
val get = new Get(rowkey)
get
}
)
// 6. 遍历结果
result.foreach(data => {
// 注意 Data是 Tuple 类型
val result: Result = data._2
// 获取 Cell数组对象
val cells: Array[Cell] = result.rawCells()
// 遍历
for (cell <- cells) {
// 获取对应的值
val rowKey = Bytes.toString(CellUtil.cloneRow(cell))
val qualifier = Bytes.toString(CellUtil.cloneQualifier(cell))
val value = Bytes.toString(CellUtil.cloneValue(cell))
// 打印输出结果
println("[ " + rowKey + " , " + qualifier + " , " + value + " ]")
}
})
} finally {
sc.stop()
}
}
使用Mvn 打包 mvn clean install -DskipTests
将scalahbasetest-1.0-sNAPSHOT.jar包上传到spark安装目录,比如(opt/spark***/jars)
4.2将Hbase的jar包拷贝到Spark安装目录(比如/opt/spark***/jars)
4.3使用spark-submit运行两个scala文件
一个是插入数据
./spark-submit --class tech.zhaoxin.HBaseBulkPutExample --master spark://master:7077 /opt/spark-2.3.4-bin-hadoop2.7/jars/scalahbasetest-1.0-SNAPSHOT.jar
一个是获取数据
./spark-submit --class tech.zhaoxin.HBaseBulkGetExampleByRDD --master spark://master:7077 /opt/spark-2.3.4-bin-hadoop2.7/jars/scalahbasetest-1.0-SNAPSHOT.jar
4.4Scala如何转Java
可以先转成class,再用反编译的工具转成Java,目前还没有正式去试验下。