Question 96: What is Spark dataset partition?

Answer: Spark Dataset comprises a fixed number of partitions, and each partition will have number of records in it.

Question 97: What is done during the shuffling?

Answer: Spark performs shuffle, which transfer data around the cluster and results in a new stage with a new set of partitions.

Question 98: You have been given below code snippet, do you think that data shuffling will be applied on partitions across the nodes?

sc.textFile("hadoopexam.txt").map(mapFunc).flatMap(flatMapFunc).filter(filterFunc)

 

Answer: When data in one partition is not depend on partition, as well as transformation, which does not require another partition from different node. Then shuffling will not be applied. Out of all three transformation map, flatMap and filter does not require shuffling of data across partitions.

Question 99: Why does having more stage, will impact on the performance of Spark Application?

Answer: If you have written a Spark code, which has more stages then it suffer from performance. Because on each stage data will be persisted (either in cache or disk), also possibility of shuffling data across the partitions. As you know, wherever these two I/O (network and disk), comers into the picture there will be huge impact on performance. And stage boundary in a Spark job will cause this.

Question 100: What is the impact of having/changing numberOfPartitions during transformation, will cause the performance?

Answer: Transformations that can trigger a stage boundary typically accept a numPartitions argument, which specifies into how many partitions to split the data in the child stage. Just as the number of reducers is an important parameter in MapReduce jobs, the number of partitions at stage boundaries can determine an application's performance.