Spark Programming Guide(五)

RDD Persistence

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Spark最重要的功能之一是在操作过程中持久化（或缓存）内存中的数据集。当持久化一个 RDD 时，每个节点的其它分区都可以使用 RDD 在内存中进行计算，在该数据上的其他 action 操作将直接使用内存中的数据。这会让以后的action操作更加快速（通常会快10倍）。缓存是迭代算法和快速的交互式使用的重要工具。

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

你可以使用RDD的persist()方法和cache()方法进行持久化，当第一次执行action计算时，数据将会被持久化内存中。Spark缓存数据拥有容错机制 - 如果某个分区上的数据丢失了，Spark 将按照原来的计算过程，自动重新计算并进行缓存。

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory). The full set of storage levels is:

另外，RDD的缓存操作可以指定不通的存储级别，比如，持久化到磁盘上，已经序列化的Java对象到内存中（更加节省空间），跨界点复制。这些存储级别通过传递一个 StorageLevel 对象 (Scala, Java, Python) 给 persist() 方法进行设置。cache()方法使用的是默认的存储级别StorageLevel.MEMORY_ONLY（存储序列化对象到内存中）。所有的设置级别如下：

Storage Level	Meaning
MEMORY_ONLY	将 RDD 以反序列化的 Java 对象的形式存储在 JVM 中. 如果内存空间不够，部分数据分区将不再缓存，在每次需要用到这些数据时重新进行计算. 这是默认的级别.
MEMORY_AND_DISK	将 RDD 以反序列化的 Java 对象的形式存储在 JVM 中。如果内存空间不够，将未缓存的数据分区存储到磁盘，在需要使用这些分区时从磁盘读取.
MEMORY_ONLY_SER (Java and Scala)	将 RDD 以序列化的 Java 对象的形式进行存储（每个分区为一个 byte 数组）。这种方式会比反序列化对象的方式节省很多空间，尤其是在使用 fast serializer 时会节省更多的空间，但是在读取时会增加 CPU 的计算负担.
MEMORY_AND_DISK_SER(Java and Scala)	类似于 MEMORY_ONLY_SER ，但是溢出的分区会存储到磁盘，而不是在用到它们时重新计算.
DISK_ONLY	只在磁盘上缓存 RDD.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.	与上面的级别功能相同，只不过每个分区在集群中两个节点上建立副本.
OFF_HEAP (experimental 实验性)	类似于 MEMORY_ONLY_SER, 但是将数据存储在 off-heap memory 中. 这需要启用 off-heap 内存.

Which Storage Level to Choose?

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

Spark 的存储级别的选择，核心问题是在 memory 内存使用率和 CPU 效率之间进行权衡。建议按下面的过程进行存储级别的选择:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

如果您的 RDD 适合于默认存储级别 (MEMORY_ONLY), leave them that way. 这是CPU效率最高的选项，允许RDD上的操作尽可能快地运行.

If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

如果不是, 试着使用 MEMORY_ONLY_SER 和 selecting a fast serialization library 以使对象更加节省空间，但仍然能够快速访问。 (Java和Scala)

Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

不要溢出到磁盘，除非计算您的数据集的函数是昂贵的, 或者它们过滤大量的数据. 否则, 重新计算分区可能与从磁盘读取分区一样快.

Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

如果需要快速故障恢复，请使用复制的存储级别 (e.g. 如果使用Spark来服务来自网络应用程序的请求). All 存储级别通过重新计算丢失的数据来提供完整的容错能力，但复制的数据可让您继续在 RDD 上运行任务，而无需等待重新计算一个丢失的分区.

Removing Data

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

Spark 会自动监视每个节点上的缓存使用情况，并使用 least-recently-used（LRU）的方式来丢弃旧数据分区。如果您想手动删除 RDD 而不是等待它掉出缓存，使用 RDD.unpersist() 方法。