Spark Programming Guide(三)

Working with Key-Value Pairs

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

大多数Spark操作可以使用RDDs包含不同的数据类型来完成，一些少数特殊的操作可以使用键值对的RDDs。最常见的是分布式 “shuffle” 操作，如通过元素的 key 来进行 grouping 或 aggregating 操作。

In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)). The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.

在scala中，这些操作都是自动可用的scala包含了tuple2 RDDS（内置对象的元组的语言，由简单的写作（A，B）），键值对的操作都定义在了PairRDDFunctions类中，该类将对元组RDD的功能进行增强。

For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:

例如，下面的代码使用的 Key-Value 对的 reduceByKey 操作统计文本文件中每一行出现了多少次:

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.

我们也可以使用counts.sortByKey()，例如，对其进行按字母排序，最后使用counts.collect()方法收集结果数据返回到驱动程序

Note: when using custom objects as the key in key-value pair operations, you must be sure that a custom equals() method is accompanied with a matching hashCode() method. For full details, see the contract outlined in the Object.hashCode() documentation.

注意：当你使用键值对RDD操作一个自定义的对象时，如果你重写了equals()方法也必须重写hashCode()方法。有关详情, 请参阅 Object.hashCode() documentation 中列出的约定.

Transformations

The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.

扫描二维码关注公众号，回复： 2955798 查看本文章

下表列出了一些 Spark 常用的 transformations（转换）. 详情请参考 RDD API 文档 (Scala, Java, Python, R) 和 pair RDD 函数文档 (Scala, Java).

Transformation（转换）	Meaning（说明）
map(func)	将函数func应用于RDD中的每个元素，并将返回值生成一个新的RDD
filter(func)	将函数func应用于RDD中的每个元素，并将返回值为true的元素生成一个新的RDD
flatMap(func)	与map类似，但每个输入项都可以映射到0个或多个输出项（因此函数应该返回一个序列而不是单个项）
mapPartitions(func)	与 map 类似，但是单独的运行在在每个 RDD 的 partition（分区，block）上，所以在一个类型为 T 的 RDD 上运行时 func 必须是 Iterator => Iterator 类型
mapPartitionsWithIndex(func)	与 mapPartitions 类似，但是也需要提供一个代表 partition 的 index（索引）的 interger value（整型值）作为参数的 func，所以在一个类型为 T 的 RDD 上运行时 func 必须是 (Int, Iterator) => Iterator 类型
sample(withReplacement, fraction, seed)	样本数据，设置是否放回（withReplacement）, 采样的百分比（fraction）、使用指定的随机数生成器的种子（seed）
union(otherDataset)	反回一个新的 dataset，它包含了 source dataset（源数据集）和 otherDataset（其它数据集）的并集
intersection(otherDataset)	返回一个新的 RDD，它包含了 source dataset（源数据集）和 otherDataset（其它数据集）的交集
distinct([numTasks]))	对数据集中的元素进行去重并返回一个新的数据集
groupByKey([numTasks])	在一个 (K, V) pair 的 dataset 上调用时，返回一个 (K, Iterable) . Note: 如果分组是为了在每一个 key 上执行聚合操作（例如，sum 或 average)，此时使用 reduceByKey 或 aggregateByKey 来计算性能会更好. Note: 默认情况下，并行度取决于父 RDD 的分区数。可以传递一个可选的 numTasks 参数来设置不同的任务数
reduceByKey(func, [numTasks])	当调用这个方法时，会对相同的键执行func聚合操作并重新生成一个键值对RDD
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])	在 (K, V) pairs 的 dataset 上调用时, 返回 (K, U) pairs 的 dataset，其中的 values 是针对每个 key 使用给定的 combine 函数以及一个 neutral “0” 值来进行聚合的. 允许聚合值的类型与输入值的类型不一样, 同时避免不必要的配置. 像 groupByKey 一样, reduce tasks 的数量是可以通过第二个可选的参数来配置的
sortByKey([ascending], [numTasks])	在一个 (K, V) pair 的 dataset 上调用时，其中的 K 实现了 Ordered，返回一个按 keys 升序或降序的 (K, V) pairs 的 dataset, 由 boolean 类型的 ascending 参数来指定
join(otherDataset, [numTasks])	在一个 (K, V) 和 (K, W) 类型的 dataset 上调用时，返回一个 (K, (V, W)) pairs 的 dataset，它拥有每个 key 在两个数据集中所有的元素对。Outer joins 可以通过 leftOuterJoin, rightOuterJoin 和 fullOuterJoin 来实现
cogroup(otherDataset, [numTasks])	在一个 (K, V) 和的 dataset 上调用时，返回一个 (K, (Iterable, Iterable)) tuples 的 dataset. 这个操作也调用了 groupWith
cartesian(otherDataset)	在一个 T 和 U 类型的 dataset 上调用时，返回一个 (T, U) pairs 类型的 dataset（所有元素的 pairs，即笛卡尔积）
pipe(command, [envVars])	Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions)	Reshuffle（重新洗牌）RDD 中的数据以创建或者更多的 partitions（分区）并将每个分区中的数据尽量保持均匀. 该操作总是通过网络来 shuffles 所有的数据
repartitionAndSortWithinPartitions(partitioner)	根据给定的 partitioner（分区器）对 RDD 进行重新分区，并在每个结果分区中，按照 key 值对记录排序。这比每一个分区中先调用 repartition 然后再 sorting（排序）效率更高，因为它可以将排序过程推送到 shuffle 操作的机器上进行

Actions

The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)and pair RDD functions doc (Scala, Java) for details.

下表列出了一些 Spark 常用的 actions 操作。详细请参考 RDD API 文档 (Scala, Java, Python, R)和 pair RDD 函数文档 (Scala, Java).

Action	Meaning
reduce(func)	使用func（输入两个元素，返回一个元素）函数对Dataset中的元素进行聚合运算。这个函数应该是可交换（commutative ）和关联（associative）的，这样才能保证它可以被并行地正确计算
collect()	以数组的形式放回Dataset中的所有元素到驱动程序中，这个操作通常用在filter或者其他一些能够返回足够小的数据集的子集之上。
count()	返回Dataset元素的总个数
first()	返回Dataset的第一个元素（类似take(1)）
take(n)	以数组的形式返回Dataset中的n个元素
takeSample(withReplacement, num, [seed])	以数组的形式随机返回num个Dataset中的元素，不管是否替换，都可以预先指定一个随机数生成器种子。
takeOrdered(n, [ordering])	返回Dataset中的n个元素采用自然顺序或自定义的比较器
saveAsTextFile(path)	将Dataset中的元素以Text文件形式存入本地文件系统、hdfs或者任何hadopp支持的文件系统。Spark将会将所有的元素转换为字符串作为文件的一行记录
saveAsSequenceFile(path) (Java and Scala)	将Dataset中的元素以 Hadoop SequenceFile形式存入本地文件系统、hdfs或者任何hadopp支持的文件系统。该操作可以在实现了 Hadoop 的 Writable 接口的键值对（key-value pairs）的 RDD 上使用。在 Scala 中，它还可以隐式转换为 Writable 的类型（Spark 包括了基本类型的转换，例如 Int, Double, String 等等)
saveAsObjectFile(path) (Java and Scala)	使用 Java 序列化（serialization）以简单的格式（simple format）存储数据集的元素，然后可以使用 SparkContext.objectFile() 再次进行加载
countByKey()	仅适用于（K,V）类型的 RDD 。返回具有每个 key 的计数的（K , Int）pairs 的 hashmap
foreach(func)	在DataSet的每个元素上运行一个func函数