A simple example of Doris integrating Spark reading and writing

A simple example of Doris integrating Spark reading and writing



0, written in front

  • Doris version: Doris-1.1.5
  • Spark version: Spark-3.0.0
  • IDEA version: IntelliJ IDEA 2019.2.3
  • Scala version: Scala-2.12.11

1. Introduction to Spark Doris Connector

  • introduce

Spark Doris ConnectorIt supports reading data stored in Doris through Spark, and also supports writing data to Doris through Spark.

Code base address: https://github.com/apache/incubator-doris-spark-connector

  • version compatible
Connector Spark Doris Java Scala
2.3.4-2.11.xx 2.x 0.12+ 8 2.11
3.1.2-2.12.xx 3.x 0.12.+ 8 2.12
3.2.0-2.12.xx 3.2.x 0.12.+ 8 2.12
  • Manage with Maven
<dependency>
  <groupId>org.apache.doris</groupId>
   <!-- spark3.x使用这个版本 -->
  <artifactId>spark-doris-connector-3.1_2.12</artifactId>
   <!-- spark2.x使用这个版本 -->
  <!--artifactId>spark-doris-connector-2.3_2.11</artifactId-->
  <version>1.1.0</version>
</dependency>

Note: At the same time, do not use the version 1.0.1 of the official website for the Spark Doris Connector version here, and the relevant error will be demonstrated below

2. Basic example

2.1 Prepare tables and data in advance

Open the fe and be of doris

-- 创建表table1
CREATE TABLE table1 (
    siteid INT DEFAULT '10',
    citycode SMALLINT,
    username VARCHAR(32) DEFAULT '', 
    pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, citycode, username) 
DISTRIBUTED BY HASH(siteid) BUCKETS 10
PROPERTIES("replication_num" = "1");

-- 插入数据
insert into table1 values (1,1,'jim',2),
(2,1,'grace',2),
(3,2,'tom',2),
(4,3,'bush',3),
(5,3,'helen',3);

2.2 New project

  • Create a new Maven project named doris-module

  • Prepare the Spark environment: pom.xml

<properties>
    <scala.binary.version>2.12</scala.binary.version>
    <spark.version>3.0.0</spark.version>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>
</properties>

<dependencies>

    <!-- Spark 的依赖引入 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <scope>provided</scope>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.binary.version}</artifactId>
        <scope>provided</scope>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_${scala.binary.version}</artifactId>
        <scope>provided</scope>
        <version>${spark.version}</version>
    </dependency>
    <!-- 引入 Scala -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.12.11</version>
    </dependency>

    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.47</version>
    </dependency>

    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.49</version>
    </dependency>

    <!--spark-doris-connector-->
    <dependency>
        <groupId>org.apache.doris</groupId>
        <artifactId>spark-doris-connector-3.1_2.12</artifactId>
        <!--<artifactId>spark-doris-connector- 2.3_2.11</artifactId>-->
        <version>1.1.0</version>
    </dependency>

</dependencies>

<build>
    <plugins>
        <!--编译 scala 所需插件-->
        <plugin>
            <groupId>org.scala-tools</groupId>
            <artifactId>maven-scala-plugin</artifactId>
            <version>2.15.1</version>
            <executions>
                <execution>
                    <id>compile-scala</id>
                    <goals>
                        <goal>add-source</goal>
                        <goal>compile</goal>
                    </goals>
                </execution>
                <execution>
                    <id>test-compile-scala</id>
                    <goals>
                        <goal>add-source</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.2.2</version>
            <executions>
                <execution>
                    <!-- 声明绑定到 maven 的 compile 阶段 -->
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

        <!-- assembly 打包插件 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.0.0</version>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <archive>
                    <manifest>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with- dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
        </plugin>
        <!--<plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.6.1</version>
                    &lt;!&ndash; 所有的编译都依照 JDK1.8 &ndash;&gt;
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                    </configuration>
                </plugin>-->
    </plugins>
</build>

2.3 Use SQL to read and write

2.3.1 Code

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object SQLDemo {
    
    
    def main(args: Array[String]): Unit = {
    
    
        // TODO 如果要打包提交集群执行,请注释掉(此处直接在本地演示)
        val sparkConf = new SparkConf().setAppName("SQLDemo").setMaster("local[2]") 
        val	sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

        sparkSession.sql( """
                            |CREATE TEMPORARY VIEW spark_doris
                            |USING doris
                            |OPTIONS(
                            | "table.identifier"="test_db.table1",
                            | "fenodes"="node01:8030",
                            | "user"="test",
                            | "password"="test"
                            |); """.stripMargin)

        // 读取数据
        	sparkSession.sql("select * from spark_doris").show()
        // 写入数据
//        sparkSession.sql("insert into spark_doris values(99,99,'haha',5)")
    }
}

Read data and run results:

insert image description here

Verify the correctness of the result: enter fe to connect to MySQL, query the data in table1, the results are as follows

insert image description here

The running results are consistent with the actual results

  • data input

After running, query the data in table1, the results are as follows:

insert image description here

It can be seen that using spark to write data to Doris has been successful

2.3.2 Related Errors

At the beginning, the right-click project Add framework supportdoes not have Scala

  • Select scala-related dependencies and delete them (as shown in the figure below)

insert image description here

insert image description here

Right-click again Add framework supportto add the Scala environment

java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
	at cn.whybigdata.doris.spark.SQLDemo$.main(SQLDemo.scala:7)
	at cn.whybigdata.doris.spark.SQLDemo.main(SQLDemo.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 2 more

Process finished with exit code 1

Because in the pom.xml file, spark-core_x.xx, spark-sql_x.xx, spark-hive_x.xx are providedall scopes, the solution is as follows:

  • Select Edit Run/Debug configuration, select the Application that needs to be executed, and check Include dependencies with "Provided" scopeit, as shown in the figure below

insert image description here

The above method only works for the current .scala program. If you want to work for all Applications of the project, you can select template, then select Application, and check Include dependencies with "Provided" scopeit

insert image description here

There is also a more direct way: directly comment out spark-core_x.xx、spark-sql_x.xx、spark-hive_x.xxthe dependent scope range, but in this way 不推荐, because in most cases, it is chosen to be packaged to the cluster for execution instead of local, and the cluster generally already has the spark environment , providedthe scope used will not be loaded when executed in the cluster.

  • [Bug] spark doris connector read table error: Doris FE’s response cannot map to schema.

The reason is that the spark-doris-connector's own bug in version 1.0.1 has been fixed in version 1.1.0

2.4 Use DataFrame to read and write data ( batch )

2.4.1 Code

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object DataFrameDemo {
    
    
    def main(args: Array[String]): Unit = {
    
    
        //TODO 如果要打包提交集群执行,请注释掉
        val sparkConf = new SparkConf().setAppName("DataFrameDemo").setMaster("local[2]")
        val	sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

        // TODO 写入数据
//        import sparkSession.implicits._
//        val mockDataDF = List(
//            (11,23, "haha", 8),
//            (11, 3, "hehe", 9),
//            (11, 3, "heihei", 10)
//        ).toDF("siteid", "citycode", "username","pv")
//        mockDataDF.show(5)
//
//        mockDataDF.write.format("doris")
//          .option("doris.table.identifier", "test_db.table1")
//          .option("doris.fenodes", "node01:8030")
//          .option("user", "test")
//          .option("password", "test")
//          // 指定你要写入的字段
//          //	.option("doris.write.fields", "user")
//          .save()

        // TODO 读取数据
        val dorisSparkDF = sparkSession.read.format("doris")
          .option("doris.table.identifier", "test_db.table1")
          .option("doris.fenodes", "hadoop102:8030")
          .option("user", "test")
          .option("password", "test")
          .load()
        dorisSparkDF.show()
    }
}

2.4.2 Write data

operation result:

insert image description here

verify:

insert image description here

2.4.2 Reading data

operation result:

insert image description here

2.5 RDD demo

RDD currently only supports reading data

  • the code
import org.apache.spark.{
    
    SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession


object RDDDemo {
    
    
    def main(args: Array[String]): Unit = {
    
    
        //TODO 如果要打包提交集群执行,请注释掉
        val sparkConf = new SparkConf().setAppName("RDDDemo").setMaster("local[2]")
        val	sc = new SparkContext(sparkConf)

        // TODO 读取数据
        import org.apache.doris.spark._
        val dorisSparkRDD = sc.dorisRDD(
            tableIdentifier = Some("test_db.table1"),
            cfg = Some(Map(
                "doris.fenodes" -> "node01:8030",
                "doris.request.auth.user" -> "test",
                "doris.request.auth.password" -> "test"
            ))
        )

        dorisSparkRDD.collect().foreach(println)

    }
}

operation result:

insert image description here

2.6 Other ways to write data

Regarding writing data to Doris through Saprk, you can also use StructStreaming,

  • official example
## stream sink(StructuredStreaming)
val kafkaSource = spark.readStream
  .option("kafka.bootstrap.servers", "$YOUR_KAFKA_SERVERS")
  .option("startingOffsets", "latest")
  .option("subscribe", "$YOUR_KAFKA_TOPICS")
  .format("kafka")
  .load()
kafkaSource.selectExpr("CAST(key AS STRING)", "CAST(value as STRING)")
  .writeStream
  .format("doris")
  .option("checkpointLocation", "$YOUR_CHECKPOINT_LOCATION")
  .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
    .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
  .option("user", "$YOUR_DORIS_USERNAME")
  .option("password", "$YOUR_DORIS_PASSWORD")
  //其它选项
  //指定你要写入的字段
  .option("doris.write.fields","$YOUR_FIELDS_TO_WRITE")
  .start()
  .awaitTermination()

3. Description of configuration items

3.1 General configuration items

Key Default Value Comment
doris.fenodes Doris FE http address, supports multiple addresses, separated by commas
doris.table.identifier Doris table name, such as: db1.tbl1
doris.request.retries 3 Number of retries to send requests to Doris
doris.request.connect.timeout.ms 30000 Connection timeout for sending requests to Doris
doris.request.read.timeout.ms 30000 Read timeout for sending requests to Doris
doris.request.query.timeout.s 3600 The timeout period for querying doris, the default value is 1 hour, -1 means no timeout limit
doris.request.tablet.size Integer.MAX_VALUE The number of Doris Tablets corresponding to an RDD Partition. The smaller this value is set, the more Partitions will be generated. This improves the parallelism on the Spark side, but at the same time puts more pressure on Doris.
doris.batch.size 1024 The maximum number of rows to read data from BE at a time. Increasing this value can reduce the number of connections between Spark and Doris. Thereby reducing the additional time overhead caused by network delay.
doris.exec.mem.limit 2147483648 Memory limit for a single query. The default is 2GB, the unit is byte
doris.deserialize.arrow.async false Whether to support asynchronous conversion of Arrow format to RowBatch required for spark-doris-connector iteration
doris.deserialize.queue.size 64 Asynchronously convert the internal processing queue in Arrow format, which takes effect when doris.deserialize.arrow.async is true
doris.write.fields Specify the fields or the order of fields written to the Doris table, and separate multiple columns with commas. When writing by default, all fields are written in the order of the Doris table fields.
sink.batch.size 10000 The maximum number of rows that can be written to BE at a time
sink.max-retries 1 The number of retries after writing a BE failure
sink.properties.* Import parameters for Stream Load. For example: 'sink.properties.column_separator' = ', '
doris.sink.task.partition.size The number of Partitions corresponding to the Doris write task. After Spark RDD is filtered and other operations, the number of Partitions written in the end may be relatively large, but the number of records corresponding to each Partition is relatively small, resulting in increased writing frequency and waste of computing resources. The smaller this value is set, the Doris write frequency can be reduced and the Doris merge pressure can be reduced. This parameter is used in conjunction with doris.sink.task.use.repartition.
doris.sink.task.use.repartition false Whether to use the repartition method to control the number of Partitions written by Doris. The default value is false, which is controlled by coalesce (note: if there is no Spark action operator before writing, the parallelism of the entire calculation may be reduced). If set to true, the repartition method will be used (note: the last Partition number can be set, but additional shuffle overhead will be added).

3.2 SQL and Dataframe proprietary configuration

Key Default Value Comment
user Username to access Doris
password Password to access Doris
doris.filter.query.in.max.count 100 In predicate pushdown, the maximum number of elements in the value list of the in expression. If this number is exceeded, the in expression conditional filtering is processed on the Spark side.

3.3 RDD-specific configuration

Key Default Value Comment
doris.request.auth.user Username to access Doris
doris.request.auth.password 访问Doris的密码
doris.read.field 读取Doris表的列名列表,多列之间使用逗号分隔
doris.filter.query 过滤读取数据的表达式,此表达式透传给Doris。Doris使用此表达式完成源端数据过滤。

3.4 Doris 和 Spark 列类型映射关系

Doris Type Spark Type
NULL_TYPE DataTypes.NullType
BOOLEAN DataTypes.BooleanType
TINYINT DataTypes.ByteType
SMALLINT DataTypes.ShortType
INT DataTypes.IntegerType
BIGINT DataTypes.LongType
FLOAT DataTypes.FloatType
DOUBLE DataTypes.DoubleType
DATE DataTypes.StringType1
DATETIME DataTypes.StringType1
BINARY DataTypes.BinaryType
DECIMAL DecimalType
CHAR DataTypes.StringType
LARGEINT DataTypes.StringType
VARCHAR DataTypes.StringType
DECIMALV2 DecimalType
TIME DataTypes.DoubleType
HLL Unsupported datatype

Note:Connector中,将DATEDATETIME映射为String。由于Doris底层存储引擎处理逻辑,直接使用时间类型时,覆盖的时间范围无法满足需求。所以使用 String 类型直接返回对应的时间可读文本。

4. 使用 JDBC 的方式

这种方式是早期写法,不推荐,原因是:Spark 无法感知Doris 的数据分布,会导致打到 Doris 的查询压力非常大

  • 代码:
import java.util.Properties
import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    SaveMode, SparkSession}

object JDBCDemo {
    
    
    def main(args: Array[String]): Unit = {
    
    
        val	sparkConf = new SparkConf().setAppName("JDBCDemo").setMaster("local[2]")
        val	sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

        // TODO 写入数据
//        import sparkSession.implicits._
//        val mockDataDF = List(
//            (21,23, "bj", 8),
//            (21,13, "sh", 9),
//            (21,31, "sz", 10)
//        ).toDF("siteid", "citycode", "username","pv")
//
//        val prop = new Properties()
//        prop.setProperty("user", "test")
//        prop.setProperty("password", "test")
//
//        mockDataDF.write.mode(SaveMode.Append)
//          .jdbc("jdbc:mysql://node01:9030/test_db", "table1", prop)

        // TODO 读取数据
        val df=sparkSession.read.format("jdbc")
        .option("url","jdbc:mysql://node01:9030/test_db")
        .option("user","test")
        .option("password","test")
        .option("dbtable","table1")
        .load()

        df.show()
    }
}
  • 写入数据

insert image description here

  • 读取数据

insert image description here

5. 其他集成系统

Doris can also be integrated with Flink, DataX, MySQL, Logstash, ODBC external tables, you can directly refer to the official website

6. References

  • https://doris.apache.org/zh-CN/docs/dev/ecosystem/spark-doris-connector
  • https://9to5answer.com/java-lang-noclassdeffounderror-org-apache-spark-sql-sparksession
  • https://github.com/apache/doris-spark-connector/issues/39

Guess you like

Origin blog.csdn.net/m0_52735414/article/details/128846042