Getting Started with PySpark Basics (3): RDD Persistence

Persistence of RDD

RDD data is process data, so persistent storage is required;

Iterative calculations are performed between RDDs, and the generation of new RDDs means the disappearance of old RDDs; this feature can maximize the use of resources, and old RDDs can be cleaned up from memory in a timely manner to make room for subsequent calculations. out of space;

As shown below:

The first use of rdd3 is when building rdd4. After building rdd4, rdd3 does not exist; and when using rdd3 for the second time, because it no longer exists, it needs to be re-executed from rdd according to the blood relationship of RDD. Build rdd3 for use by rdd5;

RDD caching

RDD can be stored in memory or on disk through caching technology, so that there is no need to repeatedly build rdd;

Common cache APIs are as follows:

For general use rdd.persist(StorageLevel.MEMORY_AND_DISK), cache on disk first;

If it is a cluster with relatively small memory, it can only be cached on the disk;

API for manually clearing the cache:rdd.unpersist()

Characteristics of the cache: the cache is considered insecure, so the blood relationship between RDDs is preserved

Because the cached data is at risk of loss, the cache in the memory may be cleared due to power outage/lack of space; the cache on the disk may be lost due to disk damage, etc., so the blood relationship needs to be preserved to avoid data loss;

How is the RDD cache saved?

Adoption 分散存储: Each partition of RDD saves its data in the Executor memory and disk where it is located

Checkpoint of RDD

CheckPoint is also a mechanism for saving RDD, but only supports disk storage;

Compared with caching, CheckPoint is considered safe and does not save the blood relationship between RDDs;

CheckPoint storage:

集中收集存储: CheckPoint centrally collects the data of each partition and stores them on HDFS;

API:

# 设置存储路径,如果是local模式,可以选用本地文件系统
# 如果是集群模式,一定要设置hdfs路径
sc.setCheckpointDir(path)
# 存储
rdd.checkpoint()
# 清除
rdd.unpersist()

Comparison of Cache and CheckPoint

Performance comparison between Cache and CheckPoint:

Cache performance is better, because it is distributed storage, each Executor executes in parallel, high efficiency, can be saved in memory (occupies memory), faster

CheckPoint is slower because it is centralized storage and involves network IO, but it is safer to store on HDFS (multiple copies)

Note: Cache and CheckPoint APIs are not of action type. If they want to work normally, there must be action type operators behind;

Guess you like

Origin blog.csdn.net/qq_51235856/article/details/130470508