- What is stage, with regards to Spark Job execution?
Ans: A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job.
- What is Task, with regards to Spark Job execution?
Ans: Task is an individual unit of work for executors to run. It is an individual unit of physical execution (computation) that runs on a single machine for parts of your Spark application on a data. All tasks in a stage should be completed before moving on to another stage.
- A task can also be considered a computation in a stage on a partition in a given job attempt.
- A Task belongs to a single stage and operates on a single partition (a part of an RDD).
- Tasks are spawned one by one for each stage and data partition.
- What is Speculative Execution of a tasks?
Ans: Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job.
Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset . Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel.
- Which all cluster manager can be used with Spark?
Ans:
Apache Mesos, Hadoop YARN, Spark standalone and
Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution.
- What is a BlockManager?
Ans: Block Manager is a key-value store for blocks that acts as a cache. It runs on every node, i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and offheap.
A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a cached RDD partition, intermediate shuffle data, and broadcast data.