Spark : Interview Questions Part-5

Download PDF of Apache Spark Interview Questions

41. What is coalesce transformation?

Ans: The coalesce transformation is used to change the number of partitions. It can trigger RDD shuffling depending on the second shuffle boolean input parameter (defaults to false ).

42. What is the difference between cache() and persist() method of RDD

Ans: RDDs can be cached (using RDD’s cache() operation) or persisted (using RDD’s persist(newLevel: StorageLevel) operation). The cache() operation is a synonym of persist() that uses the default storage level MEMORY_ONLY .

Premium Training : Spark Full Length Training : with Hands On Lab

43. You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?

Ans: number _2 in the name denotes 2 replicas

44. What is Shuffling?

Ans: Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors.

Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.

Premium : Hortonworks Spark Developer Certification Material (HDPCD:Spark)

45. Does shuffling change the number of partitions?

Ans: No, By default, shuffling doesn’t change the number of partitions, but their content

46. What is the difference between groupByKey and use reduceByKey ?

Ans : Avoid groupByKey and use reduceByKey or combineByKey instead.

groupByKey shuffles all the data, which is slow.

reduceByKey shuffles only the results of sub-aggregations in each partition of the data.

Premium : Cloudera Hadoop and Spark Developer Certification Material

47. When you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the result?

Ans: When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key [68]

48. What is checkpointing?

Ans: Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system.

You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.

49. What do you mean by Dependencies in RDD lineage graph?

Ans: Dependency is a connection between RDDs after applying a transformation.

50. Which script will you use Spark Application, using spark-shell ?

Ans: You use spark-submit script to launch a Spark application, i.e. submit the application to a Spark deployment environment.

Details: Category: Spark Interview Questions; Last Updated: 31 January 2020; Hits: 1841

Apache Spark Interview Questions