- What is checkpointing?
Ans: Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system.
You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.
- What do you mean by Dependencies in RDD lineage graph?
Ans: Dependency is a connection between RDDs after applying a transformation.
- Which script will you use Spark Application, using spark-shell ?
Ans: You use spark-submit script to launch a Spark application, i.e. submit the application to a Spark deployment environment.
- Define Spark architecture
Ans: Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own Java processes.
- What is the purpose of Driver in Spark Architecture?
Ans: A Spark driver is the process that creates and owns an instance of SparkContext. It is your Spark application that launches the main method in which the instance of SparkContext is created.
- Drive splits a Spark application into tasks and schedules them to run on executors.
- A driver is where the task scheduler lives and spawns tasks across workers.
- A driver coordinates workers and overall execution of tasks.