Question 21: How would you the amount of memory to allocate to each executor?

Answer: SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor.

 

Question 22: How do you define RDD?

Answer: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

 

  • Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph.
  • Distributed: across clusters.
  • Dataset: is a collection of partitioned data.
  • Typed: Data in RDD are strongly typed.
  • Lazy evaluation: Transformation (creating new RDD from existing RDD) is lazy.
  • Immutable: Once you create an RDD, its content cannot be changed.
  • Parallel Processing: On single RDD, which is distributed across the nodes in the cluster can be worked upon in parallel.
  • Caching: You can cache the RDD in memory, if you need it later on, rather than recreating again and again (Which gives the performance boost)

 

Question 23: What is Lazy evaluated RDD mean?

Answer: Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.

 

Question 24: How would you control the number of partitions of a RDD?

Answer: You can control the number of partitions of a RDD using repartition or coalesce operations.

 

Question 25: What are the possible operations on RDD

Answer: RDDs support two kinds of operations:

  • Transformations - lazy operations that return another RDD.
  • Actions - operations that trigger computation and return values.