// 将小部分数据集collect到Driver端,转为Map,然后以广播的形式分发到各个Executor
val smallMap = smallRDD.collect().toMap
val smallBroadcast = sc.broadcast(smallMap)
val joined = largeRDD.map { case (key, value) =>
// 从广播中获取value
val smallValue: Option[Char] = smallBroadcast.value.get(key)
(key, (value, smallValue))
}.filter(_._2._2.isDefined) // 过滤掉无效数据
joined.collect().foreach(println)
両者を書き込む:(典型的には、しかし、複数の点の場合にも適用)+ ListBufferをマッピング
val smallMap = smallRDD.collect().toMap
val smallBroadcast = sc.broadcast(smallMap)
val joined = largeRDD.flatMap { case (key, value) =>
// 提前准备一个Buffer
val buffer = ListBuffer[(Int, (Int, Char))]()
val smallValue: Option[Char] = smallBroadcast.value.get(key)
// 存在数据时,将数据放入Buffer
if (smallValue.isDefined)
buffer.append((key, (value, smallValue.get)))
// 返回Buffer
buffer
}
joined.collect().foreach(println)
3著:flatMap +いくつかの+なし(ベスト)
val smallMap = smallRDD.collect().toMap
val smallBroadcast = sc.broadcast(smallMap)
val joinedRDD = largeRDD.flatMap { case (key, value) =>
// 利用匹配模式,可读性较高
smallBroadcast.value.get(key) match {
// 返回类型为Option,为None的将被flatMap打平,从而过滤掉
case Some(v) => Some((key, (value, v)))
case None => None
}
}
joinedRDD.collect().foreach(println)
一般的な誤用、トラップ
共通誤操作:容器チェック値をトラバースする方法(以下、エラーの一例です)
val small = smallRDD.collect()
val smallBroadcast = sc.broadcast(small)
val joined = largeRDD.flatMap { case (key, value) =>
// 使用容器的find找到key相等的值,将会遍历容器,效率较差
// 正确的做法:使用HashMap,利用hash算法根据key查找value,效率更高
smallBroadcast.value
.find(_._1 == key) match {
case Some(v) => Some((key, (value, v)))
case None => None
}
}
val small = smallRDD.collect()
val smallBroadcast = sc.broadcast(small)
val joined = largeRDD.flatMap { case (key, value) =>
// 在RDD的map、flatMap等算子中,将广播再次转换为其他数据(此处是toMap)
// RDD中由多少条数据,这个转换就会被执行多少次,对性能影响极大
// 正确做法:应该在广播前,将需要广播的数据转换好
smallBroadcast.value.toMap
.get(key) match {
case Some(v) => Some((key, (value, v)))
case None => None
}
}
よくある落とし穴:キー、元のデータが一意ではありません(以下は正しい例です)
// 原数据key不唯一
val smallRDD = sc.parallelize(
Seq((1, 'a'), (1, 'c'), (2, 'a'), (3, 'x'), (3, 'y'), (4, 'a')),
4
)
val largeRDD = sc.parallelize(
for (x <- 1 to 10000) yield (x % 4, x),
4
)
// 当原数据的key不唯一时,应该提前分组
val smallMap = smallRDD.groupByKey()
.collect()
.toMap
val smallBroadcast = sc.broadcast(smallMap)
val joinedRDD = largeRDD.flatMap { case (key, value) =>
smallBroadcast.value.get(key) match {
case Some(iter) => iter.map(v => (key, (value, v)))
case None => Iterable[(Int, (Int, Char))]()
}
}
joinedRDD.collect().foreach(println)