Spark custom sort and Zoning

Spark custom sort and Zoning

Preface:

With the continuous development of the information age, data has become the theme of the times, today we wander in the ocean of data; due to the explosive growth of data, various data calculation engine mushroomed impact of this era. As nowadays one of the most mainstream computing engine Spark also show their power of all aspects of the era. Spark both in data processing or data analysis, mining aspects are showing a strong ability to lead. Its distributed computing capability is more and more popular. This article describes the spark of ordering and partitions.

A, the Spark custom sorting

In the spark defined encapsulates many of the senior api , use them in our daily development api can get a lot of convenience. But sometimes these default rules is not sufficient to achieve our objective, at this time we need to understand the underlying principles, prepare a set of processing logic suitable for our needs. By following the code brief spark how to custom sort.

Import org.apache.spark.rdd.RDD 

Import org.apache.spark {SparkConf, SparkContext}. 

Object CustomSort1 { 

  DEF main (args: the Array [String]): Unit = { 

    Val the conf = new new . SparkConf () setAppName ( " CustomSort1. ") a setMaster (" local [*] " ) 

    Val SC = new new SparkContext (the conf) 

    // collation: first color values in descending order, if the color values are equal, then ascending order of age 

    Val Users = the Array (" laoduan 99 30 "," 29 laozhao 9999 "," 28 Laozhang 98 "," 28 laoyang 99 " ) 

    // data Driver end parallelism becomes RDD 

    Val Lines: RDD [String] = sc.parallelize (Users) 

    // cut organize your data points

    userRDD Val: RDD [the User] = lines.map (Line => { 

      Val Fields = line.split ( "" ) 

      Val name = Fields (0 ) 

      Val Age = Fields (. 1 ) .toInt 

      Val Fv = Fields (2 ). toInt 

      // (name, Age, Fv) 

      new new User (name, Age, Fv) 

    }) 

    // not meet the requirements 

    // tpRDD.sortBy (TP => tp._3, to false) 

    // the RDD was inside User type data sorting 

    Val the sorted: RDD [the User] = userRDD.sortBy (U => U) 

    Val R & lt = sorted.collect () 

    the println (r.toBuffer) 

    sc.stop ()

  }

}

class User(val name: String, val age: Int, val fv: Int) extends Ordered[User] with Serializable {

  override def compare(that: User): Int = {

    if(this.fv == that.fv) {

      this.age - that.age

    } else {

      -(this.fv - that.fv)

    }

  }

  override def toString: String = s"name: $name, age: $age, fv: $fv"

}

 

For custom ordering a variety of ways:

1, User class inherits Ordered that the User class into sortable class. In the spark Because although we are tested in a local test, but he will simulate cluster model, so we self-defined object at run time shuffle has transmission network will involve the issue of sequence. So it is necessary inherits Serializable.

2, using the case class Sample Class:

 case class Man(age: Int, fv: Int) extends Ordered[Man] {}

 

Does not need to inherit the serialization class, Case class default serialization has been achieved.

3, examples of implicit definition of collation

 object SortRules {

  implicit object OrderingUser extends Ordering[User] {

    override def compare(x: User, y: User): Int = {

      if(x.fv == y.fv) {

        x.age - y.age

      } else {

        y.fv - x.fv

      }

    }

  }

}

 

 

Main program code:

// segmentation data collation 

    Val tpRDD: RDD [(String, Int, Int)] = lines.map (Line => { 

      Val Fields = line.split ( "" ) 

      Val name = Fields (0 ) 

      Val Age = Fields ( . 1 ) .toInt 

      Val Fv = Fields (2 ) .toInt 

      (name, Age, Fv) 

    }) 

 

    // sort (collation passing a, does not change the format of the data will only change the order of) 

    Import SortRules.OrderingUser 

    Val the sorted: RDD [(String, Int, Int)] = tpRDD.sortBy (TP => the User (tp._2, tp._3))

 

4, some special data type does not require custom, using native api easier.

// make full use of the comparison rules tuples, tuples comparison rules: First than the first, longer than the second equal 

Val the sorted: RDD [(String, Int, Int)] = tpRDD.sortBy (TP => (- tp._3, tp._2))

 

5, added to the collation conversion recluse

   

  // Ordering [(Int, Int)] Comparison of the final rule format 

    // ON [(String, Int, Int)] prior to the un-compared data format 

    // (T => (- t._3, t._2)) how to convert to the format you want to compare the rules of 

    the Implicit Val rules = Ordering [(Int, Int)] ON [(String, Int, Int)] (t => (-. t._3, t._2)) 

Val sorted : RDD [(String, Int, Int)] = tpRDD.sortBy (TP => TP)

 

Two, the Spark custom partitioner

1、combineByKey

In reduceByKey , GroupByKey and other operators are based combineByKey operator implementation. This is a low-level operator, you can customize some of the rules more flexible.

Rdd.combineByKey(x=>x,(m:Int,n:Int)=>m+n,(a:Int,B:Int)=>a+b,new HashPartition(2),true,null)

 

Parameters explanation:

(1) , the same key a value into a partition

(2) partial polymerization

(3), the overall polymerization

(4), the partition number (the number of partitions may be provided)

(5) , whether the map terminal partial polymerization

(6), a sequence parameter

   conbineByKey is a relatively low-level api , it may not be used under normal circumstances, but when some senior api when we can not meet the needs of solving it provides us convenient.

2, the custom partitioner

In the spark calculations inevitably involve shuffle , the data will be distributed to all partitions is different partitions according to different rules. Therefore, the partition decides the upstream transmission data to which the downstream. Student data in different professional computing different professional student achievement. Groups take topN :

 ( 1 ), custom partitioner  

// custom partitioner: majors: professional collection 

class MajorParitioner (Majors: the Array [String]) the extends Partitioner { 

  // corresponding to the main constructor (executed when a new back) 

  // for storing a rule Map 

  Val the rules = new new mutable.HashMap [String, Int] () 

  var I = 0 for (Major <- Majors) { // the rules (Major) = I 
    rules.put (Major, I) 
    I + =. 1 
  } // return partition number (the number of partition next RDD) 
  the override DEF numPartitions: Int = majors.length // calculated from the passed reference numeral partition Key // Key is a tuple (String, String)

  

    




  


  

  

  DEF getPartition the override (Key: the Any): Int = { 

    // Get Key 

    Val Major = key.asInstanceOf [(String, String)] ._. 1 

    // calculates the division number according to the rules of 

    the rules (Major) 

  } 

}

 

 ( 2 ) using a custom partitioner

// calls the custom partitions, a partition and for the specified partition 

    Val majorPatitioner = new new MajorParitioner (Subjects); 

    // partitionBy accordance partition partitioning rules specified 

    // when calling a Key is partitionBy RDD (String, String) 

    partitioned Val: RDD [((String, String), Int)] = reduced.partitionBy (majorPatitioner) 

    // If a time out partition (operating data of a partition) 

    Val the sorted: RDD [((String, String ), Int)] = partitioned.mapPartitions (IT => { 

      // converted into iterator List, and then sorting, conversion to the iterator returned 

      it.toList.sortBy (_._ 2) .reverse.take (topN ). Iterator 

    }) 

    //
 
Val R & lt: the Array [((String, String), Int)] = sorted.collect ()

 

By custom rear partition, data shuffle the data in each partition is after a student data, sort data after the partition is removed prior to N th is the desired result. But the program still will be a problem when too much data, they might lead to memory overflow, because we are the data into a list of the sort, and the list is stored in memory. It will lead to memory overflow. So how can we avoid this situation yet. We can mapPartitions a set of internal definition, does not load all the data. , Each time after ordering this set the minimum value is removed by a final set after several cycles the rest is the desired result.

Third, the summary

  Whether it is the sort or partition, spark are encapsulated advanced api total we use, but he will not be applicable in all situations, and will only apply some cases, through these api to achieve understand the underlying rules can be customized editing a program suitable for our needs. This way it can greatly improve efficiency. Nothing can fit everything, resourcefulness is the way to win.

 

Guess you like

Origin www.cnblogs.com/lsbigdata/p/10933494.html