实验手册 - 第4周Pair RDD

Pair RDD概述

“键值对”是一种比较常见的RDD元素类型,分组和聚合操作中经常会用到。
Spark操作中经常会用到“键值对RDD”(Pair RDD),用于完成聚合计算。
普通RDD里面存储的数据类型是Int、String等,而“键值对RDD”里面存储的数据类型是“键值对”。

一、Transformation算子

(1) map, flatMap, filter, sortBy, distinct

(2) RDD间的操作:union, subtract, intersection

(3) 适用于Pair RDD:keys, values, reduceByKey, mapValues, flatMapValues, groupByKey, sortByKey

(4) Pair RDD间的操作:join, leftOuterJoin, rightOuterJoin

二、Action算子

(1) count, first, take, collect, top, takeOrdered, foreach, reduce

(2) 针对Pair RDD: collectAsMap, countByKey

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()

三、实验内容

实验1

实验1:
已知 scores =

[ (“Tom”, “Spark”, 80), (“Tom”, “Hadoop”, 88), (“Tom”, “NoSQL”, 90),

(“Lucy”, “Spark”, 66), (“Lucy”, “Hadoop”, 98), (“Lucy”, “NoSQL”, 80) ]

(1) 计算各位同学的成绩总和。输出结果:[(‘Lucy’, 244), (‘Tom’, 258)]

scores =\
[("Tom", "Spark", 80), ("Tom", "Hadoop", 88), ("Tom", "NoSQL", 90),
("Lucy", "Spark", 66), ("Lucy", "Hadoop", 98), ("Lucy", "NoSQL", 80)]
rdd = sc.parallelize(scores)
# rdd.map(lambda x:(x[0],x[2])).collect()
rdd.map(lambda x:(x[0],x[2])).reduceByKey(lambda x,y:x+y)\
.sortBy(lambda x:x[1],True).collect()
[('Lucy', 244), ('Tom', 258)]

(2) 找出各位同学成绩最高的课程。输出结果:[ (‘Lucy’, (‘Hadoop’, 98)), (‘Tom’, (‘NoSQL’, 90)) ]

# 方法一
rdd.map(lambda x:(x[0],(x[1],x[2])))\
.reduceByKey(lambda x,y:x if x[1]>y[1] else y).collect()
# 方法二:先排序再取值
# import operator
rdd.map(lambda x:((x[0],x[1]),x[2])).sortBy(lambda x:x[1],False).collect()
[(('Lucy', 'Hadoop'), 98),
 (('Tom', 'NoSQL'), 90),
 (('Tom', 'Hadoop'), 88),
 (('Tom', 'Spark'), 80),
 (('Lucy', 'NoSQL'), 80),
 (('Lucy', 'Spark'), 66)]
rdd.map(lambda x:((x[0],x[1]),x[2])).sortBy(lambda x:x[1],False)\
.map(lambda x:(x[0][0],(x[0][1],x[1]))).take(2)
[('Lucy', ('Hadoop', 98)), ('Tom', ('NoSQL', 90))]

(3) 将各位同学的成绩汇集显示。输出结果:

[(‘Lucy’, [(‘Spark’, 66), (‘Hadoop’, 98), (‘NoSQL’, 80)]),

(‘Tom’, [(‘Spark’, 80), (‘Hadoop’, 88), (‘NoSQL’, 90)])]

rdd.map(lambda x:[x[0],(x[1],x[2])]).groupByKey().mapValues(list).collect()
[('Tom', [('Spark', 80), ('Hadoop', 88), ('NoSQL', 90)]),
 ('Lucy', [('Spark', 66), ('Hadoop', 98), ('NoSQL', 80)])]

(4) 计算各位同学的平均成绩。输出结果:[(‘Lucy’, 81.3), (‘Tom’, 86.0)]

rdd.map(lambda x:(x[0],x[2]))\
.groupByKey().mapValues(lambda x:round((sum(x)/len(x)),1)).collect()
[('Tom', 86.0), ('Lucy', 81.3)]

实验2

实验2:已知 Others\StudentData.csv是学生信息表,每行表示一条记录

请编程:

(1) 去掉表头

rdd2 = sc.textFile(r'D:\juniortwo\spark\Spark2023-02-20\Others\StudentData.csv')
rdd2.take(2)
['age,gender,name,course,roll,marks,email',
 '28,Female,Hubert Oliveras,DB,02984,59,Annika Hoffman_Naoma [email protected]']
noHeader2 = rdd2.filter(lambda x:'age'  not in x)
noHeader2.take(3)
['28,Female,Hubert Oliveras,DB,02984,59,Annika Hoffman_Naoma [email protected]',
 '29,Female,Toshiko Hillyard,Cloud,12899,62,Margene Moores_Marylee [email protected]',
 '28,Male,Celeste Lollis,PF,21267,45,Jeannetta Golden_Jenna [email protected]']

(2) 计算分数(marks)大于等于60的记录数。输出结果:517

noHeader2.filter(lambda x:int(x.split(',')[5])>=60).count()
517

(3) 分别计算男同学和女同学的分数(marks)总和。输出结果:[(‘Female’, 29636), (‘Male’, 30461)]

noHeader2.map(lambda x:(x.split(',')[1],int(x.split(',')[5])))\
.groupByKey().mapValues(lambda x:sum(x)).collect()
[('Female', 29636), ('Male', 30461)]

(4) 统计参加各个课程(course)的学生的数量。输出结果:[(‘DB’, 157),
(‘Cloud’, 192),
(‘PF’, 166),
(‘MVC’, 157),
(‘OOP’, 152),
(‘DSA’, 176)]

list2 = noHeader2.map(lambda x:(x.split(',')[3],1)).countByKey()
list(list2.items())
[('DB', 157),
 ('Cloud', 192),
 ('PF', 166),
 ('DSA', 176),
 ('MVC', 157),
 ('OOP', 152)]

(5) 统计各门课程最高分。输出结果:[(‘DB’, 98), (‘Cloud’, 99), (‘PF’, 99), (‘MVC’, 99), (‘OOP’, 99), (‘DSA’, 99)]

noHeader2.map(lambda x:(x.split(',')[3],x.split(',')[5])).reduceByKey(max).collect()
[('DB', '98'),
 ('Cloud', '99'),
 ('PF', '99'),
 ('MVC', '99'),
 ('OOP', '99'),
 ('DSA', '99')]

(6) 计算各个课程的平均分。输出结果:

[ (‘DB’, 59.044585987261144),
(‘Cloud’, 59.598958333333336),
(‘PF’, 59.83734939759036),

(‘MVC’, 61.05095541401274),
(‘OOP’, 58.6578947368421),
(‘DSA’, 62.21590909090909) ]

两种求平均值的方法

noHeader2.map(lambda x : x.split(',')) \
           .map(lambda x : (x[3], (int(x[5]), 1))) \
           .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \
           .mapValues(lambda x : (x[0] / x[1])).collect()
noHeader2.map(lambda x:(x.split(',')[3],int(x.split(',')[5])))\
.groupByKey().mapValues(lambda x:sum(x)/len(x)).collect()
[('DB', 59.044585987261144),
 ('Cloud', 59.598958333333336),
 ('PF', 59.83734939759036),
 ('MVC', 61.05095541401274),
 ('OOP', 58.6578947368421),
 ('DSA', 62.21590909090909)]

(7) 计算男同学和女同学(gender)的平均年龄(age)。平均年龄保留两位小数。输出结果:[(‘Female’, 28.49), (‘Male’, 28.52)]

noHeader2.map(lambda x:(x.split(',')[1],int(x.split(',')[0])))\
.groupByKey().mapValues(lambda x:round(sum(x)/len(x),2)).collect()
[('Female', 28.49), ('Male', 28.52)]

实验3

实验3:已知 SalesOrders目录中是订单主表,SalesOrder_items目录中是订单明细表(从表) (备注:主表和明细表类似于数据库中的是一对多关系)

其中,订单主表中的字段依次为:订单ID、订单日期、客户编号、订单状态。

订单明细表中记录了该订单包含的商品信息。其字段依次为:序号、订单ID(相当于外键,链接到主表的订单ID)、商品编号、购买数量、商品总价、商品单价。

(1) 统计状态为CLOSED的订单数量。输出结果:7556

rdd01 = sc.textFile(r'D:\juniortwo\spark\Spark2023-02-20\SalesOrders')
rdd02 = sc.textFile(r'D:\juniortwo\spark\Spark2023-02-20\SalesOrder_items')
rdd01.take(2)
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT']
rdd01.filter(lambda x:x.split(',')[3]=='CLOSED').count()
7556

(2) 分别统计各个状态的订单数量,输出结果:

[(‘CLOSED’, 7556), (‘PENDING_PAYMENT’, 15030), (‘COMPLETE’, 22899), (‘PROCESSING’, 8275), (‘PAYMENT_REVIEW’, 729),(‘PENDING’, 7610), (‘ON_HOLD’, 3798), (‘CANCELED’, 1428), (‘SUSPECTED_FRAUD’, 1558)]

list3 =rdd01.map(lambda x:(x.split(',')[3],1)).countByKey()
list(list3.items())
[('CLOSED', 7556),
 ('PENDING_PAYMENT', 15030),
 ('COMPLETE', 22899),
 ('PROCESSING', 8275),
 ('PAYMENT_REVIEW', 729),
 ('PENDING', 7610),
 ('ON_HOLD', 3798),
 ('CANCELED', 1428),
 ('SUSPECTED_FRAUD', 1558)]

(3) 统计各个订单的总金额, 返回总金额最大的前5个订单。输出结果:

[(68703, 3449.91), (68724, 2859.89), (68858, 2839.91), (68809, 2779.86), (68766, 2699.9)]

【备注:输出结果中每个元祖的两个元素分别表示:(订单ID, 总金额)】

rdd02.take(3)
['1,1,957,1,299.98,299.98', '2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0']
rdd02.map(lambda x:(int(x.split(',')[1]),float(x.split(',')[4])))\
.groupByKey().mapValues(lambda x:round(sum(x),2))\
.sortBy(lambda x:x[1],False).take(5)
[(68703, 3449.91),
 (68724, 2859.89),
 (68858, 2839.91),
 (68809, 2779.86),
 (68766, 2699.9)]

(4) 统计2013年7月份和8月份的订单总量。输出结果:7213

import operator 
from operator import add
rdd01.map(lambda x:(x.split(',')[0],x.split(',')[1]))\
.filter(lambda x:x[1].startswith('2013-07') or x[1].startswith('2013-08'))\
.map(lambda x:(x[0],1)).reduceByKey(add)\
.map(lambda x:x[1]).reduce(add)
7213
# 方法二
ord.filter(lambda x : "2013-08" in x or "2013-07" in x).count()

(5) 找出在2013和2014年都下单的客户ID。

输出结果:[‘256’, ‘12111’, ‘11318’, ‘7130’, ‘2911’, ‘5657’, ‘9842’, …略…]

rdd01.map(lambda x:(x.split(',')[1],x.split(',')[2]))\
.filter(lambda x:x[0].startswith('2013')).map(lambda x:x[1]).take(5)
['11599', '256', '12111', '8827', '11318']
first = rdd01.map(lambda x:(x.split(',')[1],x.split(',')[2]))\
.filter(lambda x:x[0].startswith('2013')).map(lambda x:x[1])
second = rdd01.map(lambda x:(x.split(',')[1],x.split(',')[2]))\
.filter(lambda x:x[0].startswith('2014')).map(lambda x:x[1])
first.intersection(second).take(10)
['256', '12111', '11318', '7130', '2911', '5657', '9842', '9488', '333', '656']

(6) 统计各个客户的消费总额,并按照消费总额降序排序,输出前5名。

输出结果:[(791.0, 10524.17), (9371.0, 9299.03), (8766.0, 9296.14), (1657.0, 9223.71), (2641.0, 9130.92)]

【备注:输出结果中每个元祖的两个元素分别表示:(客户ID, 消费总金额)】

# rdd01.map(lambda x:(x.split(',')[0],x.split(',')[2])).take(10)
rd1 = rdd01.map(lambda x:(x.split(',')[0],x.split(',')[2]))
# rdd02.map(lambda x:(x.split(',')[1],x.split(',')[4])).take(10)
rd2 = rdd02.map(lambda x:(x.split(',')[1],x.split(',')[4]))
rdd = rd1.join(rd2)
rdd.map(lambda x:(float(x[1][0]),float(x[1][1])))\
.groupByKey().mapValues(lambda x:round(sum(x),2))\
.sortBy(lambda x:x[1],False).take(5)
[(791.0, 10524.17),
 (9371.0, 9299.03),
 (8766.0, 9296.14),
 (1657.0, 9223.71),
 (2641.0, 9130.92)]

猜你喜欢

转载自blog.csdn.net/m0_52331159/article/details/130138634
rdd
今日推荐