Spark RDD(Resilient DistributedDataset) 运算
在多节点hadoop的maseter中进行
登录spark spark-shell --master local[*]
1建⽴立intRDD并转换为Array
val intRDD = sc.parallelize(List(3,1,2,5,5))
intRDD.collect()
2 建⽴stringRDD并转换为Array
val stringRDD =
sc.parallelize(List(“Apple","Orange","Banana","Grape","Ap
ple"))
stringRDD.collect()
(1)具名函数(备注 一行一行输入)
def addone(x:Int):Int={
| return (x + 1)
| }
intRDD.map(addone).collect()
(2) 匿名函数
intRDD.map(x => x + 1).collect()
(3)匿名函数+匿名参数
intRDD.map(_ + 1).collect()
4 map字符串运算
stringRDD.map(x=>"fruit:" + x).collect()
5 filter数字运算
intRDD.filter(x => x < 3).collect()
intRDD.filter(_ < 3).collect()
6 filter字符串运算
stringRDD.filter(x => x.contains(“ra")).collect()
7 distinct运算
intRDD.distinct().collect()
stringRDD.distinct().collect()
8 randomSplit运算
val sRDD = intRDD.randomSplit(Array(0.4,0.6))
sRDD.size
sRDD(0).collect()
sRDD(1).collect()
9 groupBy运算
val gRDD = intRDD.groupBy(x => {if(x % 2 == 0) "even"
else “odd"}).collect()
gRDD(0)
gRDD(1)
10 多RDD转换运算
val intRDD1 = sc.parallelize(List(3,1,2,5,5))
val intRDD2 = sc.parallelize(List(5,6))
val intRDD3 = sc.parallelize(List(2,7))
(1)union并集运算
intRDD1.union(intRDD2).union(intRDD3).collect()
(intRDD1++ intRDD2++ intRDD3).collect()
(2)intersection交集运算
intRDD1.intersection(intRDD2).collect()
(3)subtract差集运算
intRDD1.subtract(intRDD2).collect()
(4)cartesian笛卡尔积运算
11 RDD基本动作运算
(1)读取运算
intRDD.first()
intRDD.take(2)
intRDD.takeOrdered(3)
intRDD.takeOrdered(3).(Ordering[Int].reverse)
(2)统计运算
intRDD.stats()
intRDD.min()
intRDD.max()
intRDD.stdev()
intRDD.count()
intRDD.sum()
intRDD.mean()
12 RDD Key-Value 基本转换运算
val kvRDD1 = sc.parallelize(List((3,4),(3,6),(5,6),
(1,2)))
//查看键
kvRDD1.keys.collect()
//查看值
kvRDD1.values.collect()
//key小于5的
kvRDD1.filter{case(key,value) => key < 5}.collect()
//值小于5的
kvRDD1.filter{case(key,value) => value < 5}.collect()
//进行map运算
kvRDD1.mapValues(x => x*x).collect()
//进行排序按照关键字的大小 默认true
kvRDD1.sortByKey(true).collect()
kvRDD1.sortByKey().collect()
kvRDD1.sortByKey(false).collect()
//相同的key加起来
kvRDD1.reduceByKey((x,y)=>x+y).collect()
//reduce简写的运算方式
kvRDD1.reduceByKey(_+_).collect()
13 多个RDD Key-Value 转换运算
val kvRDD1 = sc.parallelize(List((3,4),(3,6),(5,6),
(1,2)))
val kvRDD2 = sc.parallelize(List((3,8)))
//链接并打印 相同key就匹配
kvRDD1.join(kvRDD2).foreach(println)
//左链接
kvRDD1.leftOuterJoin(kvRDD2).foreach(println)
//右链接
kvRDD1.rightOuterJoin(kvRDD2).foreach(println)
//键值对的差集运算
kvRDD1.subtract(kvRDD2).collect()
14 Key-Value 动作运算
kvRDD1.first()
kvRDD1.take(2)
val kvFirst = kvRDD1.first
kvFirst._1
kvFirst._2
//统计key的个数
kvRDD1.countByKey()
//map运算
val KV=kvRDD1.collectAsMap()
KV(3)
KV(1)
//所有关键字是3的值
kvRDD1.lookup(3)
kvRDD1.lookup(5)
15 Broadcast⼴播变量(共享常亮)
(1)不使⽤Boradcast⼴播变量的情况
val kvFruit=sc.parallelize(List((1, “apple”),
(2,”orange”),(3, “banana”),(4, “grape”)))
val fruitMap = kvFruit.collectAsMap()
val fruitIds=sc.paralelize(List(2,4,1,3))
val fruitNames=fruitIds.map(x>=fruitMap(x)).collect
(2)使⽤Boradcast⼴播变量的情况
val kvFruit=sc.parallelize(List((1, “apple”),
(2,”orange”),(3, “banana”),(4, “grape”)))
val fruitMap = kvFruit.collectAsMap()
val bcFruitMap=sc.broadcast(fruitMap)
val fruitIds=sc.parallelize(List(2,4,1,3))
val fruitNames =
fruitIds.map(x->bcFruitMap.value(x)).collect
16 accumulator累加器
val intRDD = sc.parallelize(List(3,1,2,5,5))
val total = sc.accumulator(0.0)
val num = sc.accumulator(0)
intRDD.foreach(i=>{
total += i
num += 1})
println(“total=”+total.value+“, num=”+num.value)
val avg=total.value / num.value
17 RDD Persistence持久化
(1)建⽴RDD范例
val intRddMemory = sc.parallelize(List(3,1,2,5,5))
intRddMemory.persist()
intRddMemory.unpersist()
(2) 设定存储等级
import org.apache.spark.storage.StorageLevel
val intRddMemoryAndDisk = sc.parallelize(List(3,1,2,5,5))
intRddMemoryAndDisk.persist(StorageLevel.MEMORY_AND_DISK)
intRddMemoryAndDisk.unpersist()
18 使⽤Spark建⽴WordCount
Cd workspace/
rm -R wordCount/
Cd
mkdir -p ~/workspace/WordCount/data
cd ~/workspace/WordCount/data
gedit test.txt
spark spark-shell --master local[*]
Apple Apple Orange
Banana Grape Grape
(1)读取本地⽂件
val textFile=sc.textFile(“file:/home/zwxq/workspace/
WordCount/data/test.txt)
(2)读取每⼀个单词
val stringRDD=textFile.flatMap(line=>line.split(“ ”))
选项 |
持久化存储等级 |
MEMORY_ONLY |
将RDD 作为反序列化的的对象存储JVM 中。如果RDD不能被内存装 |
MEMORY_AND_DISK |
将RDD 作为反序列化的的对象存储在JVM 中。如果RDD不能被与内 |
MEMORY_ONLY_SER |
将RDD 作为序列化的的对象进⾏存储(每⼀分区占⽤⼀个字节数 |
MEMORY_AND_DISK_SER |
与MEMORY_ONLY_SER 相似,但是把超出内存的分区将存储在硬 |
DISK_ONLY |
只将RDD 分区存储在硬盘上 |
MEMORY_ONLY_2, |
与上述的存储级别⼀样,但是将每⼀个分区都复制到两个集群结点上 |
(3)建⽴Key-Value对并计算reduce
val countsRDD=stringRDD.map(word=>(word,
1).reduceByKey(_+_)
(4)存储计算结果
countsRDD.saveAsTestFile(“file:/home/zwxq/workspace/
WordCount/data/output”)
exit
ll
cd output
ll
(5)查看结果
cat part-00000