大家:
好!Sparksql连接mysql数据库的scala代码,研究了一段时间,踩了一个坑,分享出来
package SparkSql
import java.util.Properties
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by Administrator on 2017/10/12.
* 功能:演示Sparksql连接mysql数据库
*
*/
object MysqlDemo {
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setAppName("MysqlDemo")
val sc=new SparkContext(conf)
val sqlcontext=new SQLContext(sc)
val personrdd = sc.parallelize(Array("1 tom", "2 jerry", "3 kitty")).map(_.split(" "))
//通过StructType(结构类型)直接指定每个字段的schema
val schema=StructType(
List(
StructField("id",IntegerType,true), //true表示允许为空
StructField("name",StringType,true)
)
)
// 将rdd映射到rowrdd
val rowrdd=personrdd.map(x=>Row(x(0).toInt,x(1)))
//将schema应用到rowrdd上
val persondf=sqlcontext.createDataFrame(rowrdd,schema)
val prop=new Properties()
prop.put("user","root")
prop.put("password","root")
//将数据追加到mysql中
persondf.write.mode("append").jdbc("jdbc:mysql://192.168.17.108:3306/mysql","test_a",prop)
sc.stop()
}
}
打成 workspace-scala.jar, 并上传到目录/root/test中,直接在提交作业的时候,指定jar包以及驱动的位置
/usr/local/spark/bin/spark-submit \
--class SparkSql.MysqlDemo \
--master spark://192.168.17.108:7077 \
--executor-memory 800m \
--jars /usr/local/mysql/lib/mysql-connector-java-5.1.35-bin.jar \
--driver-class-path /usr/local/mysql/lib/mysql-connector-java-5.1.35-bin.jar \
/root/test/workspace-scala.jar
可能会遇到如下的错误: 错误代码如下所示:显示mysql的驱动包没有注册
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, hadoop): java.lang.IllegalStateException: Did not find registered driver with class com.mysql.jdbc.Driver
解决办法: 将mysql-connector-java-5.1.35-bin.jar 存放到目录/usr/local/spark/lib中
修改spark_env.sh的文件,将jar包的路径增加下:
export SPARK_CLASSPATH=/usr/local/spark/lib/*
说明: jar包就是放到了/usr/local/spark/lib/的目录上
然后提交作业就不用指定jar包的位置了
/usr/local/spark/bin/spark-submit \
--class SparkSql.MysqlDemo1 \
--master spark://192.168.17.108:7077 \
--executor-memory 800m \
/root/test/workspace-scala.jar
重新执行jar包之后,登录mysql中验证数据质量,没有问题
小收获: 如果是仅仅是一个字段,代码的核心点有以下的两个:
1 创建rdd和 StructType进行以下的修改:
val personrdd=sc.parallelize(Array("11","12","13"))
//通过StructType(结构类型)直接指定每个字段的schema
val schema=StructType(
List(
StructField("id",IntegerType,true) //true表示允许为空
)
)
2 将rdd映射到rowrdd, 这点很重要,踩了一个坑:
val rowrdd=personrdd.map(x=>Row(x(0).toInt)) 这是错误的,虽然不报错,但是实际插入的结果是"49","49","49"
val rowrdd=personrdd.map(x=>Row(x)) 这是正确的。 验证之后,mysql中实际插入的数据就是"11","12","13"