RDD 函数

● countByKey : 根据key计算每个key出现的次数,操作与value无关

scala> val a = sc.parallelize(List(("ws",1),("nice",2),("ws",2),("hi",1),("hi",9)))
a: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:24

scala> a.countByKey
res141: scala.collection.Map[String,Long] = Map(hi -> 2, ws -> 2, nice -> 1)

● filterByRange :区间过滤元素 , 开区间

scala> val a = sc.parallelize(List((1,2),(2,3),(3,5),(3,2),(4,7)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> a.filterByRange(1,4)
res145: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[52] at filterByRange at <console>:27

scala> a.filterByRange(1,4).collect
res146: Array[(Int, Int)] = Array((1,2), (2,3), (3,5), (3,2), (4,7))

scala> a.filterByRange(1,3).collect
res147: Array[(Int, Int)] = Array((1,2), (2,3), (3,5), (3,2))
scala> val a = sc.parallelize(List(("a",1),("aa",2),("c",1),("b",1),("e",1)))
a: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[55] at parallelize at <console>:24

scala> a.filterByRange("a","b")
res148: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[56] at filterByRange at <console>:27

scala> a.filterByRange("a","b").collect
res149: Array[(String, Int)] = Array((a,1), (aa,2), (b,1))

● flatMapValues : 对[K,V]型数据中的V值flatmap操作

scala> val a = sc.parallelize(List(("a","1 2"),("b","1 2"),("c","1 2"),("d","1 2"),("e","1 2")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.flatMapValues(_.split(" "))
res1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[1] at flatMapValues at <console>:27

scala> a.flatMapValues(_.split(" ")).collect
res2: Array[(String, String)] = Array((a,1), (a,2), (b,1), (b,2), (c,1), (c,2), (d,1), (d,2), (e,1), (e,2))

● foldByKey

scala> val a = sc.parallelize(List("one","two","three","one","four"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> val b = a.map(x=>(x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[7] at map at <console>:26

scala> b.collect
res11: Array[(Int, String)] = Array((3,one), (3,two), (5,three), (3,one), (4,four))

scala> val c = b.foldByKey("")(_+_)
c: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[8] at foldByKey at <console>:28

scala> c.collect
res12: Array[(Int, String)] = Array((4,four), (3,onetwoone), (5,three))

● cache : 将RDD缓存到内存中, 以便重用;若内存不足,则将一部分缓存到内存(缓存策略有多种,具体可以查看底层persist() )

scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)                         

scala> val cacheed = a.cache
cacheed: a.type = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> cacheed.count
res1: Long = 10

缓存如下图
在这里插入图片描述

● unpersist :释放内存中缓存数据 , 或者结束进程便自动释放。

scala> cacheed.unpersist()
res6: cacheed.type = ParallelCollectionRDD[0] at parallelize at <console>:24

如图,内存数据空空如也
在这里插入图片描述
● checkpoint : 将RDD的数据,持久化一份到容错的文件系统上(如hdfs)
创建hdfs目录:

scala> sc.setCheckpointDir("hdfs://hadoop-01:9000/checkpoint-10-3")

如上代码操作后会在hdfs生成一份目录 :

[root@hadoop-02 ~]# hdfs dfs -ls /
drwxr-xr-x   - root supergroup          0 2018-09-26 02:58 /checkpoint-10-3/11f244c0-6de2-4de4-b8b8-81b1ea2f1dfb

使用checkpoint :

scala> val a = sc.parallelize(1 to 1000000)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

#注意 : 使用checkpoint时,是不会立即在hdfs中生成缓存数据,因为checkpoint是一个Transformation操作。
scala> a.checkpoint

#Action操作才会在hdfs生成缓存,如下 : 此操作会启动2个task,一个缓存数据,一个计算结果
scala> a.sum
res10: Double = 5.000005E11 

#查看hdfs目录 : 
[root@hadoop-02 ~]# hdfs dfs -ls /checkpoint-10-3/11f244c0-6de2-4de4-b8b8-81b1ea2f1dfb/rdd-2
Found 3 items
-rw-r--r--   3 root supergroup    3560045 2018-09-26 03:06 /checkpoint-10-3/11f244c0-6de2-4de4-b8b8-81b1ea2f1dfb/rdd-2/part-00000
-rw-r--r--   3 root supergroup    3560045 2018-09-26 03:06 /checkpoint-10-3/11f244c0-6de2-4de4-b8b8-81b1ea2f1dfb/rdd-2/part-00001
-rw-r--r--   3 root supergroup    3560055 2018-09-26 03:06 /checkpoint-10-3/11f244c0-6de2-4de4-b8b8-81b1ea2f1dfb/rdd-2/part-00002

猜你喜欢

转载自blog.csdn.net/bb23417274/article/details/82926488
rdd