Spark Interview Questions-5

Question 21: How would you the amount of memory to allocate to each executor?

Answer: SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor.

Question 22: How do you define RDD?

Answer: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph.
Distributed: across clusters.
Dataset: is a collection of partitioned data.
Typed: Data in RDD are strongly typed.
Lazy evaluation: Transformation (creating new RDD from existing RDD) is lazy.
Immutable: Once you create an RDD, its content cannot be changed.
Parallel Processing: On single RDD, which is distributed across the nodes in the cluster can be worked upon in parallel.
Caching: You can cache the RDD in memory, if you need it later on, rather than recreating again and again (Which gives the performance boost)

Question 23: What is Lazy evaluated RDD mean?

Answer: Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.

Question 24: How would you control the number of partitions of a RDD?

Answer: You can control the number of partitions of a RDD using repartition or coalesce operations.

Question 25: What are the possible operations on RDD

Answer: RDDs support two kinds of operations:

Transformations - lazy operations that return another RDD.
Actions - operations that trigger computation and return values.

Details: Category: Spark; Last Updated: 24 April 2021

Related Articles

Spark Interview Questions-5