Question 36: What is wide Transformations?

Answer: Wide transformations are the result of groupByKey and reduceByKey . The data required to compute the records in a single partition may reside in many partitions of the parent RDD.

All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.

  

Question 37: How can you use Machine Learning library “SciKit library” which is written in Python, with Spark engine?

Answer: Machine learning tool written in Python, e.g. SciKit library, can be used as a Pipeline API in Spark MLlib or calling pipe().

 

Question 38: Why Spark is good at low-latency iterative workloads e.g. Graphs and Machine Learning?

Answer: Machine learning algorithms for instance logistic regression require many iterations before creating optimal resulting model. And similarly in graph algorithms which traverse all the nodes and edges. Any algorithm which needs many iteration before creating results can increase their performance when the intermediate partial results are stored in memory or at very fast solid state drives.

Spark can cache/store intermediate data in memory for faster model building and training.

Also, when graph algorithms are processed then it traverses graphs one connection per iteration with the partial result in memory. Less disk access and network traffic can make a huge difference when you need to process lots of data.

Question 39: Data is spread in all the nodes of cluster, how spark tries to process this data?

Answer: By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.

Question 40: How would you hint, minimum number of partitions while transformation ?

Answer: You can request for the minimum number of partitions, using the second input parameter to many transformations.

scala> sc.parallelize(1 to 100, 2).count

 

Preferred way to set up the number of partitions for an RDD is to directly pass it as the second input parameter in the call like rdd = sc.textFile("hdfs://… /file.txt", 400) , where400 is the number of partitions. In this case, the partitioning makes for 400 splits that would be done by the Hadoop’s TextInputFormat , not Spark and it would work much faster. It’salso that the code spawns 400 concurrent tasks to try to load file.txt directly into 400 partitions.