Question 21: How would you the amount of memory to allocate to each executor?
Answer: SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor.
Question 22: How do you define RDD?
Answer: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
- Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph.
- Distributed: across clusters.
- Dataset: is a collection of partitioned data.
- Typed: Data in RDD are strongly typed.
- Lazy evaluation: Transformation (creating new RDD from existing RDD) is lazy.
- Immutable: Once you create an RDD, its content cannot be changed.
- Parallel Processing: On single RDD, which is distributed across the nodes in the cluster can be worked upon in parallel.
- Caching: You can cache the RDD in memory, if you need it later on, rather than recreating again and again (Which gives the performance boost)
Question 23: What is Lazy evaluated RDD mean?
Answer: Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.
Question 24: How would you control the number of partitions of a RDD?
Answer: You can control the number of partitions of a RDD using repartition or coalesce operations.
Question 25: What are the possible operations on RDD
Answer: RDDs support two kinds of operations:
- Transformations - lazy operations that return another RDD.
- Actions - operations that trigger computation and return values.