Question 121: What is a Catalyst optimizer?

Answer: It is a core part of SparkSQL, which is written using Scala. This helps Spark in

  • Schema inference from JSON data

You can say that, it is helping Spark SQL to generate query plan, which can be easily converted to the Direct Acyclic Graph of RDD’s. Once DAG is created it is ready to execute. It does lot of things before creating optimized query plan. Main purpose of this optimizer, is to create optimized DAG.

Both the query plans and optimized query plans are internally represented as trees. Catalyst optimizer has various other libraries, which help in working on this trees.

Question 122: What all types of plans can be created by the Catalyst optimizer?

Answer: Catalyst optimizer can create following two query plans

  • Logical plan: It defines the computations on the DataSets, without defining how to carry out the specific computations.
  • Physical plan: It defines computation of the datasets, which can be executed to get expected result. Generally multiple physical plans are generated by the optimizer and then later on using Cost-based optimizer, less costly plan will be selected to execute the query.

Question 123: How do you print the all the plans created by Catalyst optimizer for running a query?

Answer: We have to use explain(Boolean) method. Something similar to below

dataframe1.join(dataframe2, "region").selectExpr("count(*)").explain(true)

 

Question 124: What is Project Tungsten?

Answer: You can say that, this is one of the largest execution engine for Spark. It has more focus on observing CPU and Memory, rather than I/O and network. In Spark CPU and Memory was the major bottleneck for performance.

Before Spark 2.0, a most of the CPU cycles were wasted, rather than using for computation, they were used for read/write of intermediate data to CPU cache.

Project tungsten helped in improving efficiency of memory and CPU, so that maximum hardware limits can be used.

Question 125: Do you see any pain point with regards to Spark DStream?

Answer: There are some common issues with the Spark DStream

  • Timestamp: It consider timestamp, when event entered into the Spark system, rather than attached timestamp of the event.
  • API: You have to write different code for both batch and steam processing.
  • Failure condition: Developer has to manage various failure conditions.