Question 51: How do you compare MapReduce and Spark Application?

Answer: Spark has many advantages over the Hadoop job, let’s describe each one

MapReduce: The highest level unit of computation in MapReduce is a Job. Jobs responsibility includes to load data, applies map function and then shuffles it, after that run reduce function and finally write data to persistence storage.

Spark Application: Highest-level unit of computation is an application. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. So a Spark application can consists of more than just a single MapReduce job.

MapReduce starts a process for each task. In contrast, a Spark application can have processes running on its behalf even when it is not running any application. And in case of Spark, multiple tasks can run within the same executor. Both by combining extremely fast task startup as well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce.

Question 52: Please explain the Spark execution model?

Answer: Spark execution model have following concepts

  • Driver: An application maps to a single driver process. Driver process manages the job flow and schedule tasks and is available the entire time the application is running. Typically, this driver process is the same as the client process used to initiate the job, although when run on YARN, the driver can run in the cluster. In interactive mode, the shell itself is the driver process.
  • Executor: For a single application/driver set of executor processes are distributed across the hosts in a cluster. The executors are responsible for performing work, in the form of tasks, as well as for storing any data that you cache. Executor lifetime depends on whether dynamic allocation is enabled. An executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime.

 

  • Stage: A stage is a collection of tasks that run the same code, each on a different subset of the data.

Question 53: What is Spark Streaming?

Answer: Spark streaming is an extension of core Spark that enables scalable, high-throughput, fault-tolerant processing of data streams. Spark streaming receives input data streams and divides the data into batches called DStreams.

Question 54: How do you define DStream or what is DStream?

Answer: You can create DStream from sources like Kafka, Flume, and Kinesis or by applying operations on the other DStreams. Every DStream is associated with a Receiver, which receives the data from source and stores it in executor memory.

Question 55: What is Dynamic Allocation?

Answer: Dynamic allocation allows Spark (Only on YARN) to dynamically scale the cluster resources allocated to your application based on the workload. When dynamic allocation is enabled and a Spark application has a backlog of pending tasks, it can request executors. When the application becomes idle, its executors are released and can be acquired by other applications.

When Spark dynamic resource allocation is enabled, all resources are allocated to the first submitted job available causing subsequent applications to be queued up. To allow applications to acquire resources in parallel, allocate resources to pools and run the applications in those pools and enable applications running in pools to be preempted.