Question 101: Which all methods should be avoided, so less amount of data shuffling happens across the partitions?
Answer: When choosing an arrangement of transformations, minimize the number of shuffles and the amount of data shuffled. Shuffles are expensive operations. All shuffle data must be written to disk and then transferred over the network. repartition, join, cogroup , and any of the *By or *ByKey transformations can result in shuffles. Not all these transformations are equal.
Question 102: What is the difference between below two Spark steps and which is good for performance perspective?
- groupByKey().mapValues(_.sum)
- reduceByKey(_ + _)
Answer: Both the above operations results in the same result. However, the first one, as we have already discussed in previous question that entire dataset will be transferred across the network, while un case of second one computes the data locally, and sums for each key in each partition and combines those local sums into larger sums after shuffling.
Question 103: What performance issue, you will observe with the below code segment and which one you suggest as a better alternative?
rdd.map(kv => (kv._1, new Set[String]() + kv._2)).reduceByKey(_ ++ _) |
Answer: reduceByKey when the input and output value types are different as given in above scenario a transformation that finds all the unique strings corresponding to each key. You could use map to transform each element into a Set and then combine the Sets with reduceByKey().
This results in unnecessary object creation because a new set must be allocated for each record. Instead, use aggregateByKey, which performs the map-side aggregation more efficiently:
val zero = new collection.mutable.Set[String]() rdd.aggregateByKey(zero)((set, v) => set += v,(set1, set2) => set1 ++= set2) |
Question 104: If you have a small dataset, which needs to be joined with another bigger dataset, what approach you will use in this case?
Answer: As you mentioned one dataset is smaller and other is very big. Then we will consider using broadcast variable, which will help in improving the overall performance. To avoid shuffles when joining two datasets, you can use broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor. A map transformation can then reference the hash table to do lookups.
Question 105: When it is advantageous to have shuffle?
Answer: When you are working with huge volume of data and more processing power is also available. And application is compute intensive, hence we need to use shuffling in this case. So that data can be processed in parallel using all the available CPUs. Another use case is aggregation, if a huge volume of data and you want to apply aggregate function on that, then single thread of the driver will become bottleneck. You should shuffle data across the nodes and then apply the aggregate functions on that data locally on each node. So that data can be aggregated parallel first and then final aggration will be done on the driver program.