A simple example of Doris integrating Spark reading and writing
Article Directory
0, written in front
- Doris version: Doris-1.1.5
- Spark version: Spark-3.0.0
- IDEA version: IntelliJ IDEA 2019.2.3
- Scala version: Scala-2.12.11
1. Introduction to Spark Doris Connector
- introduce
Spark Doris Connector
It supports reading data stored in Doris through Spark, and also supports writing data to Doris through Spark.
Code base address: https://github.com/apache/incubator-doris-spark-connector
- version compatible
Connector | Spark | Doris | Java | Scala |
---|---|---|---|---|
2.3.4-2.11.xx | 2.x | 0.12+ | 8 | 2.11 |
3.1.2-2.12.xx | 3.x | 0.12.+ | 8 | 2.12 |
3.2.0-2.12.xx | 3.2.x | 0.12.+ | 8 | 2.12 |
- Manage with Maven
<dependency>
<groupId>org.apache.doris</groupId>
<!-- spark3.x使用这个版本 -->
<artifactId>spark-doris-connector-3.1_2.12</artifactId>
<!-- spark2.x使用这个版本 -->
<!--artifactId>spark-doris-connector-2.3_2.11</artifactId-->
<version>1.1.0</version>
</dependency>
Note: At the same time, do not use the version 1.0.1 of the official website for the Spark Doris Connector version here, and the relevant error will be demonstrated below
2. Basic example
2.1 Prepare tables and data in advance
Open the fe and be of doris
-- 创建表table1
CREATE TABLE table1 (
siteid INT DEFAULT '10',
citycode SMALLINT,
username VARCHAR(32) DEFAULT '',
pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, citycode, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10
PROPERTIES("replication_num" = "1");
-- 插入数据
insert into table1 values (1,1,'jim',2),
(2,1,'grace',2),
(3,2,'tom',2),
(4,3,'bush',3),
(5,3,'helen',3);
2.2 New project
-
Create a new Maven project named doris-module
-
Prepare the Spark environment: pom.xml
<properties>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.0.0</spark.version>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<!-- Spark 的依赖引入 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<scope>provided</scope>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<scope>provided</scope>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<scope>provided</scope>
<version>${spark.version}</version>
</dependency>
<!-- 引入 Scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.11</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.49</version>
</dependency>
<!--spark-doris-connector-->
<dependency>
<groupId>org.apache.doris</groupId>
<artifactId>spark-doris-connector-3.1_2.12</artifactId>
<!--<artifactId>spark-doris-connector- 2.3_2.11</artifactId>-->
<version>1.1.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<!--编译 scala 所需插件-->
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.1</version>
<executions>
<execution>
<id>compile-scala</id>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile-scala</id>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<!-- 声明绑定到 maven 的 compile 阶段 -->
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- assembly 打包插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<archive>
<manifest>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with- dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
<!--<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<!– 所有的编译都依照 JDK1.8 –>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>-->
</plugins>
</build>
2.3 Use SQL to read and write
2.3.1 Code
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object SQLDemo {
def main(args: Array[String]): Unit = {
// TODO 如果要打包提交集群执行,请注释掉(此处直接在本地演示)
val sparkConf = new SparkConf().setAppName("SQLDemo").setMaster("local[2]")
val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
sparkSession.sql( """
|CREATE TEMPORARY VIEW spark_doris
|USING doris
|OPTIONS(
| "table.identifier"="test_db.table1",
| "fenodes"="node01:8030",
| "user"="test",
| "password"="test"
|); """.stripMargin)
// 读取数据
sparkSession.sql("select * from spark_doris").show()
// 写入数据
// sparkSession.sql("insert into spark_doris values(99,99,'haha',5)")
}
}
Read data and run results:
Verify the correctness of the result: enter fe to connect to MySQL, query the data in table1, the results are as follows
The running results are consistent with the actual results
- data input
After running, query the data in table1, the results are as follows:
It can be seen that using spark to write data to Doris has been successful
2.3.2 Related Errors
At the beginning, the right-click project
Add framework support
does not have Scala
- Select scala-related dependencies and delete them (as shown in the figure below)
Right-click again
Add framework support
to add the Scala environment
java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at cn.whybigdata.doris.spark.SQLDemo$.main(SQLDemo.scala:7)
at cn.whybigdata.doris.spark.SQLDemo.main(SQLDemo.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more
Process finished with exit code 1
Because in the pom.xml file, spark-core_x.xx, spark-sql_x.xx, spark-hive_x.xx are
provided
all scopes, the solution is as follows:
- Select
Edit Run/Debug configuration
, select the Application that needs to be executed, and checkInclude dependencies with "Provided" scope
it, as shown in the figure below
The above method only works for the current .scala program. If you want to work for all Applications of the project, you can select
template
, then selectApplication
, and checkInclude dependencies with "Provided" scope
it
There is also a more direct way: directly comment out
spark-core_x.xx、spark-sql_x.xx、spark-hive_x.xx
the dependent scope range, but in this way不推荐
, because in most cases, it is chosen to be packaged to the cluster for execution instead of local, and the cluster generally already has the spark environment ,provided
the scope used will not be loaded when executed in the cluster.
- [Bug] spark doris connector read table error: Doris FE’s response cannot map to schema.
The reason is that the spark-doris-connector's own bug in version 1.0.1 has been fixed in version 1.1.0
2.4 Use DataFrame to read and write data ( batch )
2.4.1 Code
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object DataFrameDemo {
def main(args: Array[String]): Unit = {
//TODO 如果要打包提交集群执行,请注释掉
val sparkConf = new SparkConf().setAppName("DataFrameDemo").setMaster("local[2]")
val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
// TODO 写入数据
// import sparkSession.implicits._
// val mockDataDF = List(
// (11,23, "haha", 8),
// (11, 3, "hehe", 9),
// (11, 3, "heihei", 10)
// ).toDF("siteid", "citycode", "username","pv")
// mockDataDF.show(5)
//
// mockDataDF.write.format("doris")
// .option("doris.table.identifier", "test_db.table1")
// .option("doris.fenodes", "node01:8030")
// .option("user", "test")
// .option("password", "test")
// // 指定你要写入的字段
// // .option("doris.write.fields", "user")
// .save()
// TODO 读取数据
val dorisSparkDF = sparkSession.read.format("doris")
.option("doris.table.identifier", "test_db.table1")
.option("doris.fenodes", "hadoop102:8030")
.option("user", "test")
.option("password", "test")
.load()
dorisSparkDF.show()
}
}
2.4.2 Write data
operation result:
verify:
2.4.2 Reading data
operation result:
2.5 RDD demo
RDD currently only supports reading data
- the code
import org.apache.spark.{
SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
object RDDDemo {
def main(args: Array[String]): Unit = {
//TODO 如果要打包提交集群执行,请注释掉
val sparkConf = new SparkConf().setAppName("RDDDemo").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
// TODO 读取数据
import org.apache.doris.spark._
val dorisSparkRDD = sc.dorisRDD(
tableIdentifier = Some("test_db.table1"),
cfg = Some(Map(
"doris.fenodes" -> "node01:8030",
"doris.request.auth.user" -> "test",
"doris.request.auth.password" -> "test"
))
)
dorisSparkRDD.collect().foreach(println)
}
}
operation result:
2.6 Other ways to write data
Regarding writing data to Doris through Saprk, you can also use StructStreaming,
- official example
## stream sink(StructuredStreaming)
val kafkaSource = spark.readStream
.option("kafka.bootstrap.servers", "$YOUR_KAFKA_SERVERS")
.option("startingOffsets", "latest")
.option("subscribe", "$YOUR_KAFKA_TOPICS")
.format("kafka")
.load()
kafkaSource.selectExpr("CAST(key AS STRING)", "CAST(value as STRING)")
.writeStream
.format("doris")
.option("checkpointLocation", "$YOUR_CHECKPOINT_LOCATION")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
//其它选项
//指定你要写入的字段
.option("doris.write.fields","$YOUR_FIELDS_TO_WRITE")
.start()
.awaitTermination()
3. Description of configuration items
3.1 General configuration items
Key | Default Value | Comment |
---|---|---|
doris.fenodes | – | Doris FE http address, supports multiple addresses, separated by commas |
doris.table.identifier | – | Doris table name, such as: db1.tbl1 |
doris.request.retries | 3 | Number of retries to send requests to Doris |
doris.request.connect.timeout.ms | 30000 | Connection timeout for sending requests to Doris |
doris.request.read.timeout.ms | 30000 | Read timeout for sending requests to Doris |
doris.request.query.timeout.s | 3600 | The timeout period for querying doris, the default value is 1 hour, -1 means no timeout limit |
doris.request.tablet.size | Integer.MAX_VALUE | The number of Doris Tablets corresponding to an RDD Partition. The smaller this value is set, the more Partitions will be generated. This improves the parallelism on the Spark side, but at the same time puts more pressure on Doris. |
doris.batch.size | 1024 | The maximum number of rows to read data from BE at a time. Increasing this value can reduce the number of connections between Spark and Doris. Thereby reducing the additional time overhead caused by network delay. |
doris.exec.mem.limit | 2147483648 | Memory limit for a single query. The default is 2GB, the unit is byte |
doris.deserialize.arrow.async | false | Whether to support asynchronous conversion of Arrow format to RowBatch required for spark-doris-connector iteration |
doris.deserialize.queue.size | 64 | Asynchronously convert the internal processing queue in Arrow format, which takes effect when doris.deserialize.arrow.async is true |
doris.write.fields | – | Specify the fields or the order of fields written to the Doris table, and separate multiple columns with commas. When writing by default, all fields are written in the order of the Doris table fields. |
sink.batch.size | 10000 | The maximum number of rows that can be written to BE at a time |
sink.max-retries | 1 | The number of retries after writing a BE failure |
sink.properties.* | – | Import parameters for Stream Load. For example: 'sink.properties.column_separator' = ', ' |
doris.sink.task.partition.size | – | The number of Partitions corresponding to the Doris write task. After Spark RDD is filtered and other operations, the number of Partitions written in the end may be relatively large, but the number of records corresponding to each Partition is relatively small, resulting in increased writing frequency and waste of computing resources. The smaller this value is set, the Doris write frequency can be reduced and the Doris merge pressure can be reduced. This parameter is used in conjunction with doris.sink.task.use.repartition. |
doris.sink.task.use.repartition | false | Whether to use the repartition method to control the number of Partitions written by Doris. The default value is false, which is controlled by coalesce (note: if there is no Spark action operator before writing, the parallelism of the entire calculation may be reduced). If set to true, the repartition method will be used (note: the last Partition number can be set, but additional shuffle overhead will be added). |
3.2 SQL and Dataframe proprietary configuration
Key | Default Value | Comment |
---|---|---|
user | – | Username to access Doris |
password | – | Password to access Doris |
doris.filter.query.in.max.count | 100 | In predicate pushdown, the maximum number of elements in the value list of the in expression. If this number is exceeded, the in expression conditional filtering is processed on the Spark side. |
3.3 RDD-specific configuration
Key | Default Value | Comment |
---|---|---|
doris.request.auth.user | – | Username to access Doris |
doris.request.auth.password | – | 访问Doris的密码 |
doris.read.field | – | 读取Doris表的列名列表,多列之间使用逗号分隔 |
doris.filter.query | – | 过滤读取数据的表达式,此表达式透传给Doris。Doris使用此表达式完成源端数据过滤。 |
3.4 Doris 和 Spark 列类型映射关系
Doris Type | Spark Type |
---|---|
NULL_TYPE | DataTypes.NullType |
BOOLEAN | DataTypes.BooleanType |
TINYINT | DataTypes.ByteType |
SMALLINT | DataTypes.ShortType |
INT | DataTypes.IntegerType |
BIGINT | DataTypes.LongType |
FLOAT | DataTypes.FloatType |
DOUBLE | DataTypes.DoubleType |
DATE | DataTypes.StringType1 |
DATETIME | DataTypes.StringType1 |
BINARY | DataTypes.BinaryType |
DECIMAL | DecimalType |
CHAR | DataTypes.StringType |
LARGEINT | DataTypes.StringType |
VARCHAR | DataTypes.StringType |
DECIMALV2 | DecimalType |
TIME | DataTypes.DoubleType |
HLL | Unsupported datatype |
Note:Connector中,将
DATE
和DATETIME
映射为String
。由于Doris
底层存储引擎处理逻辑,直接使用时间类型时,覆盖的时间范围无法满足需求。所以使用String
类型直接返回对应的时间可读文本。
4. 使用 JDBC 的方式
这种方式是早期写法,
不推荐
,原因是:Spark 无法感知Doris 的数据分布,会导致打到 Doris 的查询压力非常大
。
- 代码:
import java.util.Properties
import org.apache.spark.SparkConf
import org.apache.spark.sql.{
SaveMode, SparkSession}
object JDBCDemo {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("JDBCDemo").setMaster("local[2]")
val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
// TODO 写入数据
// import sparkSession.implicits._
// val mockDataDF = List(
// (21,23, "bj", 8),
// (21,13, "sh", 9),
// (21,31, "sz", 10)
// ).toDF("siteid", "citycode", "username","pv")
//
// val prop = new Properties()
// prop.setProperty("user", "test")
// prop.setProperty("password", "test")
//
// mockDataDF.write.mode(SaveMode.Append)
// .jdbc("jdbc:mysql://node01:9030/test_db", "table1", prop)
// TODO 读取数据
val df=sparkSession.read.format("jdbc")
.option("url","jdbc:mysql://node01:9030/test_db")
.option("user","test")
.option("password","test")
.option("dbtable","table1")
.load()
df.show()
}
}
- 写入数据
- 读取数据
5. 其他集成系统
Doris can also be integrated with Flink, DataX, MySQL, Logstash, ODBC external tables, you can directly refer to the official website
6. References
- https://doris.apache.org/zh-CN/docs/dev/ecosystem/spark-doris-connector
- https://9to5answer.com/java-lang-noclassdeffounderror-org-apache-spark-sql-sparksession
- https://github.com/apache/doris-spark-connector/issues/39