我们的大数据平台之前定义的步骤就是,使用ETL工具从关系型数据库抽取到HBase,然后通过Phoenix的二级索引,SQL关联查询,将大数据需要学习的训练集以及验证集提供给spark,调用spark ml的机器学习类库,做相应的算法分析,比如线性回归算法和决策树算法等等,最后生成临时表到phnenix的,使用zeppelin将数据展示出来,整个大数据平台的思路就是这样。
下面我们按照步骤逐一展开:
1.搭建Docker的单机版phoenix和hbase(生产环境建议使用集群版,可以参考https://www.cnblogs.com/chinas/p/5910854.html)
https://gitee.com/astra_zhao/hbase-phoenix-docker,进行下载,下载完后,按照README.md,最后启动,请使用如下语句启动容器
docker run -it -p 8765:8765 -跑2181:2181 iteblog/hbase-phoenix-docker
2.搭建Docker的Spark多节点环境(生产环境可以采用docker,但docker-compose要设置的比较好,因为存储文件要实时备份)
https://gitee.com/astra_zhao/docker-spark,下载后,使用docker-compose up -d即可安装成功,安装成功后,暴露端口如下:
注意,docker-compose.yml文件要加入如下说明:
master:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
extra_hosts:
- "主机名:192.168.63.9"
-"phoenix容器ID:172.17.0.2"
通过添加extra_hosts,来指定容器机器跟主机进行通讯,以及容器之间互相通讯。否则启动会报错。
3.使用Phoenix的Join操作和优化
参考这篇文章:https://www.cnblogs.com/sh425/p/7274283.html
4.搭建Java示例
4.1搭建maven工程(spring boot工程自行完成)
下面的maven支持两种打包方式,mvn clean package -Dmaven.skip.test=true是将第三方jar包打入到target目录的lib下。
mvn clean package assembly:single单独打成独立的包,建议使用第一种方式
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<spark.version>2.4.0</spark.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<!--phoenix core-->
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-core</artifactId>
<version>5.0.0-HBase-2.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>5.0.0-HBase-2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.0.6</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>2.0.6</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>2.0.6</version>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.10</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>2.0.6</version>
</dependency>
<dependency>
<groupId>org.apache.htrace</groupId>
<artifactId>htrace-core</artifactId>
<version>3.2.0-incubating</version>
</dependency>
<dependency>
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-core</artifactId>
<version>3.2.6</version>
</dependency>
</dependencies>
<build>
<resources>
<!-- 编译之后包含properties -->
<resource>
<directory>src/main/resources</directory>
<includes>
<include>**/*.properties</include>
</includes>
<filtering>true</filtering>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>tech.zhaoxin.App</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.5.5</version>
<configuration>
<archive>
<manifest>
<mainClass>tech.zhaoxin.App</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
生成Java类
public class PhoenixSparkRead {
public static void main(String[] args){
SparkConf sparkConf = new SparkConf().setMaster("spark://192.168.61.102:7077").setAppName("phoenix-test");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(jsc);
System.out.println("开始执行第一步");
// Load data from TABLE1
Dataset<Row> df = sqlContext
.read()
.format("org.apache.phoenix.spark")
.option("table", "iteblog")
.option("zkUrl", "192.168.61.102:2181")
.load();
df.createOrReplaceTempView("iteblog");
System.out.println("开始执行第二步");
SQLContext sqlCtx = new SQLContext(jsc);
df = sqlCtx.sql("SELECT * FROM iteblog");
System.out.println("开始执行第三步");
List<Row> rows = df.collectAsList();
System.out.println(rows);
jsc.stop();
System.out.println("完成");
}
}
5.配置操作
5.1进入linux服务器,将spark-2.4.1-bin-hadoop2.7.tgz放入到opt目录,进行解压操作
tar -xvzf spark-2.4.1-bin-hadoop2.7.tgz
5.2将上面mvn打包的lib目录下jar包,拷贝到opt/jars/lib目录下
5.3将下面的jar包全部拷贝到spark-2.4.1-bin-hadoop2.7/jars目录
5.4将上面文件拷入到spark的docker容器里面,参考命令如下:
docker cp /opt/phoenix/ 8ead:/usr/spark-2.4.1/jars/ (/opt/phoenix目录只包含上面图片的jar包)
然后进入容器将/usr/spark-2.4.1/jars/phoenix的jar包拷贝到上层目录
两个容器都执行如下操作
5.5最后到主机的/opt/spark-2.4.1-bin-hadoop2.7/bin目录执行如下命令:
./spark-submit --class com.astra.PhoenixSparkRead /opt/jars/spark-zeppelin-learn-1.0-SNAPSHOT.jar --jars /opt/jars/lib/*.jar --master spark://192.168.61.102:7077 --driver-memory 4g
5.6最后就能看到相关数据