21. How do you define RDD?
Ans: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
· Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph.
· Distributed: across clusters.
· Dataset: is a collection of partitioned data.
22. What is Lazy evaluated RDD mean?
Ans: Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.
23. How would you control the number of partitions of a RDD?
Ans You can control the number of partitions of a RDD using repartition or coalesce operations.
24. What are the possible operations on RDD
Ans: RDDs support two kinds of operations:
· transformations - lazy operations that return another RDD.
· actions - operations that trigger computation and return values.
25. How RDD helps parallel job processing?
Ans: Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel. Inside a partition, data is processed sequentially.
26. What is the transformation?
Ans: A transformation is a lazy operation on a RDD that returns another RDD, like map , flatMap , filter , reduceByKey , join , cogroup , etc. Transformations are lazy and are not executed immediately, but only after an action have been executed.
27. How do you define actions?
Ans: An action is an operation that triggers execution of RDD transformations and returns a value (to a Spark driver - the user program). They trigger execution of RDD transformations to return values. Simply put, an action evaluates the RDD lineage graph.
You can think of actions as a valve and until no action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data.
28. How can you create an RDD for a text file?
29. What is Preferred Locations
Ans: A preferred location (aka locality preferences or placement preferences) is a block location for an HDFS file where to compute each partition on.
def getPreferredLocations(split: Partition): Seq[String] specifies placement preferences for a partition in an RDD.
30. What is a RDD Lineage Graph
Ans: A RDD Lineage Graph (aka RDD operator graph) is a graph of the parent RDD of a RDD. It is built as a result of applying transformations to the RDD. A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.