Spark DataFrame中insertInto()与saveAsTable()区别及动态分区插入hive表使用设置

@Author  : Spinach | GHB
@Link    : http://blog.csdn.net/bocai8058

文章目录

前言

前言

在spark应用开发中，会经常需要将数据清洗后的结果，插入HIVE表中。而针对数据落表，官方提供了几种插入方式，具体有insertInto，saveAsTable，调用spark sql。下面我们一一讲解他们的区别。

在这里插入图片描述

insertInto()

保存DataFrame数据到指定hive表中，但要求满足以下两点：

指定的hive表是存在的;

DataFrame的schema结构顺序与指定Hive表的schema结构顺序是一致的。

insertInto(tableName: String) 将数据插入到指定的tableName中

scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
scala> sql("select * from t1").show
+---+---+
|  i|  j|
+---+---+
|  5|  6|
|  3|  4|
|  1|  2|
+---+---+

saveAsTable()

保存DataFrame数据到指定hive表中。分两种情况：表已经存在和表不存在

如果表不存在，则会自动创建表结构。

如果表已经存在，则此函数的行为取决于由mode函数指定的保存模式（默认情况下抛出异常）。
a.mode=Overwrite时，

当dataframe的schema与已存在的schema个数相同：DataFrame中的列顺序不需要与现有表的列顺序相同，与insertInto不同，saveAsTable将使用列名称查找正确的列位置。（与insertInto区别点）

当dataframe的schema与已存在的schema个数不同：会撇弃原有的schema，按照dataframe的schema重新创建并插入。

b.mode=Append时，

当dataframe的schema与已存在的schema个数相同：DataFrame中的列顺序不需要与现有表的列顺序相同。

当dataframe的schema与已存在的schema个数不同：会报错。

saveAsTable(tableName: String) 将数据插入到指定的tableName中

scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
scala> sql("select * from t1").show
+---+---+
|  i|  j|
+---+---+
|  1|  2|
|  4|  3|
+---+---+

调用spark sql

创建createOrReplaceTempView或createOrReplaceGlobalTempView，然后使用spark sql插入到指定hive表中，但要求满足以下两点：

指定的hive表是存在的;

createOrReplaceTempView的schema结构顺序与指定Hive表的schema结构顺序是一致的。

val viewTmp = "tmpView1" + date
tmpModelDS.createOrReplaceTempView(viewTmp)
spark.sql(s"""
               |insert overwrite table $tableName
               |partition(date='$date')
               |select * from $viewTmp
      	   """.stripMargin.trim)

注意：createOrReplaceTempView是只注册可以通过Hive查询访问的数据帧，它只是用于df的DAG的标识符。

动态分区参数设置及代码实现

如果支持动态分区，需要对Hive或spark设置如下参数：

针对Hive任务

# hive动态参数设置
set hive.exec.dynamici.partition=true; #开启动态分区，默认是false
set hive.exec.dynamic.partition.mode=nonstrict; #开启允许所有分区都是动态的，否则必须要有静态分区才能使用

# hive实现
insert overwrite table tmpTable partition(date)
select * from viewTmp;

针对Spark任务

//spark动态参数设置
SparkSession.builder()
.config("hive.exec.dynamic.partition", "true") //开启动态分区，默认是false
.config("hive.exec.dynamic.partition.mode", "nonstrict") //开启允许所有分区都是动态的，否则必须要有静态分区才能使用

//sparksql实现
tmpModelDS.write.mode(SaveMode.Overwrite).insertInto(tableName)

引用：http://spark.apache.org/docs/2.3.3/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
http://spark.apache.org/docs/2.3.3/api/scala/index.html#org.apache.spark.sql.Dataset

Spark DataFrame中insertInto()与saveAsTable()区别及动态分区插入hive表使用设置

文章目录

前言

insertInto()

saveAsTable()

调用spark sql

动态分区参数设置及代码实现

猜你喜欢