【spark】常用转换操作:keys 、values和mapValues

转载:https://www.cnblogs.com/zzhangyuhang/p/9001608.html

1.keys

功能:

  返回所有键值对的key

示例

1

2

3

4

val list = List("hadoop","spark","hive","spark")

val rdd = sc.parallelize(list)

val pairRdd = rdd.map(x => (x,1))

pairRdd.keys.collect.foreach(println)

结果

1

2

3

4

5

6

7

hadoop

spark

hive

spark

list: List[String] = List(hadoop, spark, hive, spark)

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[142] at parallelize at command-3434610298353610:2

pairRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[143] at map at command-3434610298353610:3

2.values

功能:

  返回所有键值对的value

示例

1

2

3

4

val list = List("hadoop","spark","hive","spark")

val rdd = sc.parallelize(list)

val pairRdd = rdd.map(x => (x,1))

pairRdd.values.collect.foreach(println)

结果

1

2

3

4

5

6

7

list: List[String] = List(hadoop, spark, hive, spark)

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[145] at parallelize at command-3434610298353610:2

pairRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[146] at map at command-3434610298353610:3

3.mapValues(func)

功能:

  对键值对每个value都应用一个函数,但是,key不会发生变化。

示例 

1

2

3

4

val list = List("hadoop","spark","hive","spark")

val rdd = sc.parallelize(list)

val pairRdd = rdd.map(x => (x,1))

pairRdd.mapValues(_+1).collect.foreach(println)//对每个value进行+1

结果

1

2

3

4

(hadoop,2)

(spark,2)

(hive,2)

(spark,2)

猜你喜欢

转载自blog.csdn.net/m0_37870649/article/details/81667426