spark reading myslq Optimization - Single Version

1. dependent on the environment:

<dependencies>
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.10.4</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>2.2.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>2.2.0</version>
    </dependency>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.37</version>
    </dependency>
</dependencies>

2. implementation:

val conf = new SparkConf().setMaster("local[*]").setAppName("msyql数据读取")
  val spark = SparkSession.builder().config(conf).getOrCreate()

  val url = "jdbc:mysql://localhost:3306/hisms_sn?user=root&password=root"
  val prop = new Properties()
  val properties=Map("url"->"jdbc:mysql://192.168.0.135:3306/disease-qy?useUnicode=true&characterEncoding=UTF-8",
    "driver"->"com.mysql.jdbc.Driver",
    "user"->"root",
    "password"->"root")


  //读取mysql的5中方式

  //1.不指定查询条件---并行度为1
  def method1(): Unit ={
    val df = spark.read.jdbc(url,"t_kc21k1",prop)
    println(df.COUNT ()) 
  // 2 range specified database fields - parallelism 5
    df.show (5)
    the println (df.rdd.partitions.size)
  }

/ ** 
  * Second way: the range specified database fields 
  * range by lowerBound and upperBound specified partition 
  * by column columnName specified partition (only supports plastic) 
  * Specifies the number of partitions by numPartitions (not too large) 
  * 
* / 
DEF method2 ( ): Unit = { 
  Val. 1 lowerBound = 
  Val = upperBound 100000 
  Val numPartitions. 5 = 
    Val = spark.read.jdbc DF (URL, "t_kc21k1", "ID", lowerBound, upperBound, numPartitions, prop) 
    the println (df.count ( )) 
    the println (df.rdd.partitions.size) 
    df.show (. 5) 
  } 
  . // partition according to any of the fields. 3 - parallelism of 2 
  DEF the method3 (): Unit = { 
    // the data by predicates divided into two zones according akc194
    = predicates the Array Val [String] ( "akc194 <= '2016-06-30'", "akc194 <= '2017-01-01' and akc194> '2016-06-30'") 
    Val DF = spark.read .jdbc (URL, "t_kc21k1", predicates, prop) 
    the println (df.count ()) 
    the println (df.rdd.partitions.size) 
    df.show (. 5) 
  } 

  //. 4. Load --- obtained by the method1 parallel to the same degree. 1 
  DEF method4 (): Unit = { 
    Val spark.read.format DF = ( "JDBC") Options (the Map ( "URL" -> URL, "the dbtable" -> "t_kc21k1")) Option.. ( "the fetchSize", 1000) .load () 
    the println (df.count ()) 
    the println (df.rdd.partitions.size) 
    df.show (. 5) 
  } 

  //. 5. the data loading conditions query 
  def method5 () := {Unit 
    // predicates by the data is divided into two regions according akc194 
    Val Query = "the SELECT ID, aac003, id_drg, name_drg from t_kc21k1 WHERE ID> 50000"
    // set to use brackets to wrap around, because dbtable the value will be treated as a table for the query, behind mysql connector will automatically add dbtable where 1 = 1
    val df = spark.read.format("jdbc").options(Map("url"->url,"dbtable"->s"($query)kc21k1")).load()
    println(df.count())
    println(df.rdd.partitions.size)
    df.show(5)
  }

By increasing the partitioning of data reading, only increased parallelism, but if the stand-alone version of the spark, or not reduce memory usage, spark reading rule database is extracted into the data memory, then a memory is calculated.

problem:

    Use the windows standalone version spark, do not rely hive environment, mysql data table read a lot of the time, do join operation, sparksql prone to memory overflow,

1. Currently only memory explosive manner by reading data reduction ---- example: According to the results selecting only required field.

2. You can use the same configuration hive environment, sparksql hive will help the environment, do not rely on local memory computing, to prevent memory overflow.

Reproduced in: https: //my.oschina.net/shea1992/blog/3058600

Guess you like

Origin blog.csdn.net/weixin_33750452/article/details/91967452