Question 41: What is DataSet API?

Answer: DataSet was added in Spark 1.6, it tries to provide benefits of RDDs (RDDs are strongly typed and can use lambda function) as well as SQL interface, which uses the same underline SQL engine. You can create DataSet using JVM object, once constructed you can apply functional transformation on it like map, flatMap, filter etc.

DataSet is also distributed collection, remember as of Spark 2.3.0 , DataSet API is only available for Scala and Java, and it is not available for Python.

Question 42: How do you relate to SparkContext, SQLContext and HiveContext?

Answer: SparkContext provide the entry point for Spark system, to create SQLContext you need SparkContext object. HiveContext provides a superset functionality provided by  the basic SQLContext. However, since Spark 2.0 there is a SparkSession object, which is preferred to enter into Spark system. SparkSession is unification for all these three SparkContext, SQLContext and HiveContext.

Question 43: Why do you prefer to use HiveContext?

Answer: You can use HiveContext for all the functionality provided by SQLContext, as well it has additional capabilities like you can write Queries using HiveQL, you can use Hive UDF, and you can directly read data from Hive tables.

Question 44: To use HiveContext, do you need hive setup?

Answer: No, You don’t need Hive setup in place. If there is no Hive setup, you can use HiveContext as a SQLContext. Even, HIveContext was recommended than SQLContext.

Question 45: What are the advantages of using SparkSession?

Answer: SparkSession has built in support for Hive queries, access to Hive UDFs, and ability to read data from Hive tables. To use this feature you don’t need existing Hive setup.