转载:https://www.cnblogs.com/zzhangyuhang/p/9001608.html
1.keys
功能:
返回所有键值对的key
示例
1 2 3 4 |
val list = List( "hadoop" , "spark" , "hive" , "spark" ) val rdd = sc.parallelize(list) val pairRdd = rdd.map(x => (x, 1 )) pairRdd.keys.collect.foreach(println) |
结果
1 2 3 4 5 6 7 |
hadoop spark hive spark list: List[String] = List(hadoop, spark, hive, spark) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[ 142 ] at parallelize at command- 3434610298353610 : 2 pairRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[ 143 ] at map at command- 3434610298353610 : 3 |
2.values
功能:
返回所有键值对的value
示例
1 2 3 4 |
val list = List( "hadoop" , "spark" , "hive" , "spark" ) val rdd = sc.parallelize(list) val pairRdd = rdd.map(x => (x, 1 )) pairRdd.values.collect.foreach(println) |
结果
1 2 3 4 5 6 7 |
list: List[String] = List(hadoop, spark, hive, spark) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[ 145 ] at parallelize at command- 3434610298353610 : 2 pairRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[ 146 ] at map at command- 3434610298353610 : 3 |
3.mapValues(func)
功能:
对键值对每个value都应用一个函数,但是,key不会发生变化。
示例
1 2 3 4 |
val list = List( "hadoop" , "spark" , "hive" , "spark" ) val rdd = sc.parallelize(list) val pairRdd = rdd.map(x => (x, 1 )) pairRdd.mapValues(_+ 1 ).collect.foreach(println) //对每个value进行+1 |
结果
1 2 3 4 |
(hadoop, 2 ) (spark, 2 ) (hive, 2 ) (spark, 2 ) |