Question 6: What is the Case classes, and why do you use it in Spark?
Answer: Case classes are Scala way of creating Java Pojo’s. It is implicitly create getter and setter method for the Object member. This is used in Spark to assign Schema to a DataSet/DataFrame objects.
Question 7: What is a Catalyst optimizer?
Answer: It is a core part of SparkSQL, which is written using Scala. This helps Spark in
- Schema inference from JSON data
You can say that, it is helping Spark SQL to generate query plan, which can be easily converted to the Direct Acyclic Graph of RDD’s. Once DAG is created it is ready to execute. It does lot of things before creating optimized query plan. Main purpose of this optimizer, is to create optimized DAG.
Both the query plans and optimized query plans are internally represented as trees. Catalyst optimizer has various other libraries, which help in working on this trees.
Question 8: What all types of plans can be created by the Catalyst optimizer?
Answer: Catalyst optimizer can create following two query plans
- Logical plan: It defines the computations on the DataSets, without defining how to carry out the specific computations.
- Physical plan: It defines computation of the datasets, which can be executed to get expected result. Generally multiple physical plans are generated by the optimizer and then later on using Cost-based optimizer, less costly plan will be selected to execute the query.
Question 9: How do you print the all the plans created by Catalyst optimizer for running a query?
Answer: We have to use explain(Boolean) method. Something similar to below
dataframe1.join(dataframe2, "region").selectExpr("count(*)").explain(true) |
Question 10: What is Project Tungsten?
Answer: You can say that, this is one of the largest execution engine for Spark. It has more focus on observing CPU and Memory, rather than I/O and network. In Spark CPU and Memory was the major bottleneck for performance.
Before Spark 2.0, a most of the CPU cycles were wasted, rather than using for computation, they were used for read/write of intermediate data to CPU cache.
Project tungsten helped in improving efficiency of memory and CPU, so that maximum hardware limits can be used.