pyspark RDD reduce、reduceByKey、reduceByKeyLocally用法 - 代码天地

pyspark RDD reduce、reduceByKey、reduceByKeyLocally用法

其他 2019-03-17 10:31:09 阅读次数: 0

一、reduce

Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally.

a=sc.parallelize([1,2,3,4,5],2).reduce(add)
print(a)

a=sc.parallelize((2 for _ in range(10))).map(lambda x:1).cache().reduce(add)
print(a)

二、reduceByKey(func, numPartitions=None, partitionFunc=)
Merge the values for each key using an associative and commutative reduce function.

This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.

Output will be partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified. Default partitioner is hash-partition.
按照k值操作V值，返回k-v列表

def add1(a, b):

    print("*"*55)
    print(a)
    print(b)
    return a + b+100

rdd=sc.parallelize([('a',1),('b',100),('a',300),('b',3),('a',200)])

a=sorted(rdd.reduceByKey(add1).collect())
print(a)

在这里插入图片描述

三、reduceByKeyLocally(func)
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.
同reduceByKey，但是返回一个字典

def add1(a, b):

    print("*"*55)
    print(a)
    print(b)
    return a + b


rdd=sc.parallelize([('a',1),('b',100),('a',300),('b',3),('a',200)])
a=rdd.reduceByKeyLocally(add1)
print("%"*33)
print(a)
print(type(a))

print(a.items())
print(sorted(a.items()))

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_40161254/article/details/87950595

pyspark RDD reduce、reduceByKey、reduceByKeyLocally用法

常用PySpark API（一）： parallelize, collect, map, reduce等API的简单用法 pyspark-RDD API

pyspark的RDD代码纪录

pyspark RDD 入门

PySpark中RDD与DataFrame

pyspark之rdd

pyspark RDD编程

基本的 RDD 操作——PySpark

pyspark.RDD

pyspark rdd操作

pyspark：RDD和DataFrame

pyspark rdd 基本操作

pyspark-Rdd-groupby-groupByKey-cogroup-groupWith用法

【Python】PySpark 数据计算 ③ ( RDD#reduceByKey 函数概念 | RDD#reduceByKey 方法工作流程 | RDD#reduceByKey 语法 | 代码示例 )

pyspark 将rdd 存入mysql

pyspark学习系列（一）创建RDD

PySpark tutorial 学习笔记2——RDD

Spark学习之RDD操作使用（pyspark）

PySpark学习笔记（2）——RDD基本操作

Spark学习笔记(一):pySpark RDD编程

PySpark初始化，生成RDD

pyspark.RDD aggregate 操作详解

pyspark RDD zip、zipWithUniqueId、zipWithIndex操作详解

PySpark基础入门（4）：RDD共享变量

PySpark基础入门（3）：RDD持久化

pyspark的用法

3.3 Spark RDD 键值转换操作3-groupByKey、reduceByKey、reduceByKeyLocally

弹性式分布数据集RDD——Pyspark基础（二）

从0开始学pysaprk（三）：pyspark-RDD部分学习

（2）pyspark建立RDD以及读取文件成dataframe

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)