SparkCore高级02

一、回顾

所有需求都是wordcount的变种,一定要掌握wordcount

二、Spark核心概念

1、Glossary(http://spark.apache.org/docs/2.3.2/cluster-overview.html)

(1)Application:driver program  + executors组成

主函数创建sparkcontext就相当于一个Application,

spark-shell也是一个Application,因为spark-shell在创建的过程中,初始化了Sparkcontext,因此是一个Application。

(2)Application jar:提交的jar包。

(3)Driver program:一个进程。运行在主函数里面,并创建了sparkcontext。简而言之就是一个main函数。

4)Cluster manager:一个外部服务,获取集群资源,比如(Mesos,Yarn)。简而言之就是--master是local还是yarn等模式。

(5)Deploy mode:可以区分“client”:本地 和 “cluster”:集群的container里面。

(6)Worker node:如果是运行在yarn,worker node相当于一个node manager。

(7)Executor一个进程。启用应用程序在work node上。如果是运行在yarn,executor相当于一个container(容器)。运行task。

比如:

application1:1 driver + 10 executors

application2:1 driver + 10 executors

正常情况下,两个Executor之间没关系。

(8)Task:执行代码的工作单元。

(9)Job:可以理解为触发action的行为。

(10)Stage:没有经过shuffer都是一个stage。

2、读文件,操作一个action。

val lines = sc.textFile("/opt/data/HBinzTest.txt")

lines.count

(1)研究读文件方法的Task数量,看源码

(2)默认的最小分区数,也可以自己赋值,赋值的取赋值的数。

(3)默认的是math的最小值

(4)task调度器中的默认并行参数

(5)Ctrl+Alt+B defaultParallelism,选择lacal模式的方法

(6)spark默认并行度

点进去getInt方法

可以看出,若果有设置取key的值,没值getOrElse取local[2]的2

由于没设置,所以取默认值2。

总结:

application => n job => n stage => n task

传1的时候是1,大于2,还是2。因为取2和loca[n]的最小值。

(6)tasks和文本数据记录

id为0的task读取了2行记录,总共46B

id为1的task读取了1行记录,总共23B

(7)textFile传值给defaultMinPartitions

val lines = sc.textFile("/opt/data/HBinzTest.txt",3)

lines.count

3、Components(组件)

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

spark application 独立运行程序在集群,由sparkContext在main函数(又叫driver程序)中协调。

特别是,运行在集群,sparkconText连接一些种类的cluster manager(集群管理器,spark独有的集群管理器,要么是Mesos要么是YARN),driver通过application分配资源。一旦连接,spark获取executor在work node(集群上)。executor是进程,可以运行计算和存储数据给你的application。下一步,Driver发送代码(比如jar包,或者应用文件)给executor执行计算和存储任务。

1.There are several useful things to note about this architecture: Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system. 

2.Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN). 

3.The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes. 4.Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

(1)每一个应用程序都有它自有的executor进程,executor会一直存在,它可以运行task以多线程的方式。好处:每个应用程序之间相互隔离。每个driver调度自己的task。不同应用程序的task运行在不同的JVM里的。然而,意味着你的数据不能共享在跨应用程序之中。除非写到外部存储里。 

-ps伯克利李浩源

http://www.alluxio.org/

:可以让你的应用跟其他的存储系统相互交互,以内存的速度搞定。

(2)spark不感知底层的集群管理器,意味着你不需要关注底层跑哪里,只需--master对应好即可。

(3)driver会监听和获取从executor过来的一些连接信息,说明driver和executor有通信机制(挂了要重启)。

(4)driver和executor之间需要保持网络。

总之:

一个application等于一个driver+N个executor,driver有main方法可以创建sparkcontext,提交的时候有--master可以指定运行在什么集群之上,然后从word node上拿资源,并且driver和executor之间保持通信。接着拆分作业,遇到action就会生成stage给executor启动task以多线程的方式,然后把结果cache住。

4、Cluster Manager Types

(1)Standalone

(2)Apache Mesos

(3)Hadoop YARN ------重点

(4)Kubernetes-----重点

5、Monitoring

WEB UI--http://192.168.137.131:4040/

6、Job 

Job遇到spark action会拆分为stage,再拆分为task

三、cache的用途

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory). The full set of storage levels is:

一个很重要的特性对于spark,是spark的持久化,通过操作将数据集(文件、集合等)放入内存。当你持久化一个RDD,每一个节点都会存储分区(分区=task),在内存中计算,之后还需要重用该数据集的话,可以在内存中使用这个数据集。cache是一个很重要的工具用于算法迭代以及交互使用。

你可以标识一个RDD持久化,使用persist()或cache()方法。你第一次action的时候才会被执行,它才可以保存数据在节点(executor)之上。cache的容错,之后再说。

另外,每个持久化的RDD可以使用不同的storagelevel,允许你自己的设计以任何方式。以序列化的方式会更省空间。cache方法是一个简写(StorageLevel.MEMORY_ONLY)。

1、cache

(1)lazy 等于 transformation ,延迟执行

val lines = sc.textFile("/opt/data/HBinzTest.txt")

lines.cache

没有执行,没有触发action。

lines.count

执行后,可以看到是存在内存(cache默认是在内存)了。而且大小比原文件大小要大。

(2)再count一次

这里读进来的大小跟之前存在内存的大小一样了,比原文件大小要大。

2、persisit和cache的区别

(1)源码

cache底层就是调用了persist方法,而persisit方法默认传只存内存的方式。所以cache可以达到快速编程的作用。

(2)SorageLevel

StorageLevel主构造器的参数:

默认有_useDisk,_useMemory,_useOffHeap,_deserialized,_replication几个参数

伴生对象的实现方式:


val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

(3)副本数

(4)unpersisit(),earge相当于action

lines.unpersisit()

storage马上没东西了,说明已生效,

(5)操作内存序列化

import org.apache.spark.storage.StorageLevel

val lines = sc.textFile("/opt/data/HBinzTest.txt")

lines.persist(StorageLevel.MEMORY_ONLY_SER)

lines.count

存在内存的文件明显小了很多,优化很明显

四、如何选择缓存

1、有什么方式

1)Memory_Only

存储RDD通过反序列化的方式使用java对象在JVM中。如果不够内存,会只保存一部分,其他的在使用的时候再重新计算。

2)MEMORY_AND_DISK(内存和磁盘)

存储RDD通过反序列化的方式使用java对象在JVM中。如果不够内存,会将剩下的放在磁盘。

3)MEMORY_ONLY_SER (内存和序列化)

存储RDD以序列化的方式(一个分区以一个字节数组的方式)。好处:节省空间,尤其是可以选择更快速的序列化的方式(JAVA/Kyro),但是会耗费更多的CPU。

2、如何选择StorageLevel

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

Spark的storageLevel意味着要权衡内存和CPU之间的效率来选择最佳方式。我们推荐:建议程度递减

1)如果你的RDD能够使用默认的,那就使用默认的机制(MEMORY_ONLY)。

2)如果内存真的不够了,那就序列化一下。

3)不要spill到磁盘,成本高,尽可能不用。

4)如果你想快速的恢复,你可以使用副本。全部的StorageLevel提供完整的故障再计算能力从丢失的数据中。但这个副本要在等到丢失的分区找到之后才可以再运行。

5)cache的数据很久不用就释放的机制,尽量在用完之后unpersisit()干掉。

3、Lineage(血缘)

textFile => xx => yy =>zz

           map  filter   map  .....

描述的是一个RDD如何从父RDD计算得来

1)容错性:如果一个RDD的partition丢失了,可以从它上一个RDD的partitions重新计算就可以了

2)cache也拥有容错性

4、Dependency

外部框住的是一个RDD,里面的每一块是partition

1)窄依赖

定义:

一个父RDD的partition至多被子RDD的某个partition使用一次。

特性:

join with co-partitioned就是窄依赖,不是,就是宽依赖

join的时候是pipeline,没有shuffer,容错性很强。

2)宽依赖

定义:

一个父RDD的partition会被子RDD的partition使用多次。

xxkey

join not co-partitioned

特性:

有shuffer,会生成新的stage

案例:

val lines = sc.textFile("/opt/data/HBinzTest.txt")

lines.flatMap(_.split("\t")).map(x=>(x,1)).reduceByKey(_+_).collect

reduceByKey生成多了一个stage

总结:

C=>D:窄依赖

D=>F(UNION):窄依赖

A=>B(GROUPBY):宽依赖

B+F=>G(JOIN):宽依赖

ps:黑色框表示缓存到内存,in memory。

概述:遇到1个shuffer产生2个stage,遇到2个shuffer,产生3个stage。

猜你喜欢

转载自blog.csdn.net/Binbinhb/article/details/88563900