1. What is stage, with regards to Spark Job execution?

Ans: A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job.

 

  1. What is Task, with regards to Spark Job execution?

Ans: Task is an individual unit of work for executors to run. It is an individual unit of physical execution (computation) that runs on a single machine for parts of your Spark application on a data. All tasks in a stage should be completed before moving on to another stage.

  • A task can also be considered a computation in a stage on a partition in a given job attempt.
  • A Task belongs to a single stage and operates on a single partition (a part of an RDD).
  • Tasks are spawned one by one for each stage and data partition.

 

  1. What is Speculative Execution of a tasks?

Ans: Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job.

 

Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset . Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel.

 

  1. Which all cluster manager can be used with Spark?

Ans:

Apache Mesos, Hadoop YARN, Spark standalone and

Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution.

 

 

  1. What is a BlockManager?

Ans: Block Manager is a key-value store for blocks that acts as a cache. It runs on every node, i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and offheap.

 

A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a cached RDD partition, intermediate shuffle data, and broadcast data.