Spark in GroupByKey relative advantages and disadvantages combineByKey, reduceByKey, foldByKey of

Avoid using GroupByKey

We look at two ways to calculate word counts, and a use reduceByKey, another use groupByKey:

val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
  .reduceByKey(_ + _)
  .collect()
val wordCountsWithGroup = wordPairsRDD
  .groupByKey()
  .map(t => (t._1, t._2.sum))
  .collect()

 

Two or more functions will produce correct results, the efficiency of working examples reduceByKey on large data sets will be higher. Spark known as: Before shuffle data, according to key it in each partition, combine to make the output data locally.

The following diagram describes the procedure performed reduceByKey. It is noteworthy that, prior to shuffle data, on the same machine with the same key will be the first item in the local combine (combine is passed to the function using lambda functions reduceByKey). Then the lambda function will again be called on each partition after the execution shuffle, to produce the final result.

 

In groupByKey, all key-value pairs is to shuffle the downstream partition RDD. This causes a lot of unnecessary network data transmission.

When deciding on a key-value to shuffle on which machine, spark will be key-value pair key to call a partitioning function to determine a share of the target machine. When shuffle, shuffle if the data (due to memory size limit) can not be put into a executor in all, the Spark will spill data to disk. However, when data to Disk flush, flush only once a key (corresponding to the key-value pairs of data): Therefore, if a single key corresponding to the key-value pairs of data exceeds the available memory executor, OOM exception is thrown. In the newer version of the Spark will handle this exception and let the job can continue, but still need to try to avoid such a phenomenon: When the spark needs to spill to disk, spark performance will be significantly affected.

 

 

Therefore, when calculating on a very large data sets, and for reduceByKey groupByKey is, they shuffle data to be transmitted is significantly different.

 

And when tested on a small data set (still using the example of word count), from the test results, groupByKey outperformed reduceByKey. Despite shuffle phase of view, reduceByKey memory rate will be higher in groupByKey, it will be reported relatively more low memory conditions. If need reduceByKey, you need more memory to the executor to do the calculation locally.

 

With respect groupByKey, except reduceByKey, the following function would be a better choice:

  1. combineByKey: combine elements may be used for different types of return value of the input type
  2. foldByKey: initializing a "zero value", then do the operation for the polymerization of values ​​for each Key

Then explain in detail these two functions.

 

combineByKey

We look at the definition of combineByKey:

def combineByKey[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
  combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
}

 

We can see this method call is combineByKeyWithClassTag:

def combineByKeyWithClassTag[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
  combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, defaultPartitioner(self))
}

 

Continue to view the next level call:

def combineByKeyWithClassTag[C](
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true,
    serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]

View reduceByKey Code, it is combineByKeyWithClassTag method eventually calls can be found:

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
  combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}

From combineByKeyWithClassTag method definition, the first parameter to provide a user-defined type, for the input <K, V> V conversion in the specified type for the user, the second parameter to a value V C Merge (user-defined types), the third parameter is used to combine the value of C is a single value. Here you can see the default map will do combine at the end, so the default combineByKey and reduceByKey will always do first combine operations map end.

But for groupByKey it:

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
 
val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

 

We can see, groupByKey although the method is also combineByKeyWithClassTag final call, but mapSideCombine is false, so the operation will not execute Combine map end.

 

Let's write an example of solving a combineByKey average:

type ScoreCollector = (Int, Double)
type PersonScores = (String, (Int, Double))
val initialScores = Array(("Alice", 90.0), ("Bob", 100.0), ("Tom", 93.0), ("Alice", 95.0), ("Bob", 70.0), ("Jack", 98.0))
val scoreData = sc.parallelize(initialScores).cache()
val createScoreCombiner = (score: Double) => (1, score)
val scoreMerge = (scorecollector: ScoreCollector, score: Double) =>
  (scorecollector._1 +1, scorecollector._2 + score)
val scoreCombine = (scorecollector1: ScoreCollector, scorecollector2: ScoreCollector) =>
    (scorecollector1._1 + scorecollector2._1, scorecollector1._2 + scorecollector2._2)

scoreData.combineByKey(
  createScoreCombiner,
  scoreMerge,
  scoreCombine
).map( {pscore: PersonScores => (pscore._1, pscore._2._2 / pscore._2._1)}).collect
 

Output: Array [(String, Double)] = Array ((Tom, 93.0), (Alice, 92.5), (Bob, 85.0), (Jack, 98.0))

It can be seen as the first original type (String, Double), then we pass the first argument combineByKey, and convert it into (Int, Double) in the form of, for and count the number of scores. Next, a second parameter for Merge, the same number of entries in the key, and the point addition. Finally, the third parameter is used to make combine, for each key, the obtained fraction for obtaining the sum, and then divided by the number average value was obtained.

Here it can be seen that the difference combineByKey reduceByKey are: combineByKey input data may be returned with a different type of output.

 

foldByKey

foldByKey initialization is a "zero value", then the value of a polymerisation operation key value. E.g:

val initialScores = Array(("Alice", 90.0), ("Bob", 100.0), ("Tom", 93.0), ("Alice", 95.0), ("Bob", 70.0), ("Jack", 98.0))
val scoreData = sc.parallelize(initialScores).cache()
scoreData.foldByKey(0)(_+_).collect

 

Output: Array [(String, Double)] = Array ((Tom, 93.0), (Alice, 185.0), (Bob, 170.0), (Jack, 98.0))

It can be seen here is the "zero value" is 0, when performing calculations, will first of all key value with the value "zero value" a done adding operation, and then to the (+ _ _ defined by) all key -pair do add operation. So if this time using:

scoreData.foldByKey(1)(_+_).collect

 

The output is: Array [(String, Double)] = Array ((Tom, 94.0), (Alice, 187.0), (Bob, 172.0), (Jack, 99.0))

Here is foldByKey source:

def foldByKey(
    zeroValue: V,
    partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
  // Serialize the zero value to a byte array so that we can get a new clone of it on each key
  
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
  val zeroArray = new Array[Byte](zeroBuffer.limit)
  zeroBuffer.get(zeroArray)

  // When deserializing, use a lazy val to create just one instance of the serializer per task
 
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
  val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

  val cleanedFunc = self.context.clean(func)
  combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
    cleanedFunc, cleanedFunc, partitioner)
}

 

You can see it reduceByKey and combineByKye similar, the final call is combineByKeyWithClassTag method, does not cover the value of mapSideCombine, so the default map will be combine operations end.

 

Therefore, large data sets, to reduce the amount of data shuffle with respect groupByKey, using reduceByKey, combineByKey foldByKey and would be a better choice.

 

 

References

[1] https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

[2] http://codingjunkie.net/spark-combine-by-key/

 

Guess you like

Origin www.cnblogs.com/zackstang/p/10990750.html