Question 31: What is a RDD Lineage Graph?
Answer: A RDD Lineage Graph (aka RDD operator graph) is a graph of the parent RDD of a RDD. It is built as a result of applying transformations to the RDD. A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.
Question 32: Please tell me, how execution starts and end on RDD or Spark Job
Answer: Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute.
Question 33: Give example of transformations that do trigger jobs
Answer: There are a couple of transformations that do trigger jobs, e.g. sortBy , zipWithIndex , etc.
Question 34: How many type of transformations exist?
Answer: There are two kinds of transformations:
- narrow transformations
- wide transformations
Question 35: What is Narrow Transformations?
Answer: Narrow transformations are the result of map, filter and such that is from the data from a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the result. Spark groups narrow transformations as a stage.