- You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?
Ans: number _2 in the name denotes 2 replicas
- What is Shuffling?
Ans: Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors.
Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.
- Does shuffling change the number of partitions?
Ans: No, By default, shuffling doesn’t change the number of partitions, but their content
- What is the difference between groupByKey and use reduceByKey ?
Ans : Avoid groupByKey and use reduceByKey or combineByKey instead.
groupByKey shuffles all the data, which is slow.
reduceByKey shuffles only the results of sub-aggregations in each partition of the data.
- When you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the result?
Ans: When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key