- 1- What is dependency (blood relationship)
- RDD fault tolerance can be achieved by building dependencies
- Child RDD depends on parent RDD
- 2- Why do we need dependencies
- Because Spark is based RDD parallel count calculation frame
- RDD immutable partition can be counted parallel set of calculated
- By dividing into wide dependencies and narrow dependencies, parallel computing of RDD partitions can be realized in the process of narrow dependencies
- However, in the part of wide dependence, data needs to be pulled from different partitions of the previous RDD, and parallel computing cannot be implemented in the Shuffle stage.
- 3-How many kinds of dependencies are there?
- Narrow Dependency: NarrowDependency
- Wide dependency: ShuffleDependency
- 4- How to judge whether a dependency is a narrow dependency or a wide dependency?
- Corresponding to a child RDD through a parent RDD, narrow dependency
- Corresponding to multiple child RDDs through one parent RDD, wide dependence
Here is an interview question : Is a partition of a child RDD dependent on multiple parent RDDs, wide or narrow?
1) Uncertainty, that is, the division of width and narrow dependencies is based on whether a partition of the parent RDD is dependent on multiple partitions of the child RDD , yes, it is wide dependency, or judging from the perspective of shuffle, shuffle is wide dependency, such as Join
5- What is the purpose of Spark design dependency?
- In order to be able to perform Spark parallel computing, it is the basis for dividing the stage
- In order to build a blood relationship for RDD fault tolerance , a partition data is lost, only need to recalculate from the corresponding 1 partition of the parent RDD