Question 116: What is Zookeeper service?

Answer: A centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services. In a CDH cluster, ZooKeeper coordinates the activities of high-availability services, including HDFS, Oozie, Hive, Solr, YARN, HBase, and Hue.

Question 117: How you can create DataFrame object?

Answer: There are many ways by which you can create DataFrame, but the below sources are commonly used.

  • Structured (csv, tsv) data files.
  • Hadoop Hive tables.
  • RDBMS tables, SQL Queries output
  • From an RDD
  • It also support Avro, Parquet data formats. (Learn more about avro in module-39 herecom BigData Hadoop Training)
  • You can enhance to use your custom format as well.

Question 118: Please describe something about Spark DataSet?

Answer: DataSet API was added in Spark 1.6, DataSet provides the benefit of both RDDs and the SparkSQL optimizer. You can create DataSet from a Java Objects and apply functional transformation on that using function like map, filter etc.

DataSet is a collection of strongly-typed objects, which are defined using a user-defined case classes.

Question 119: Can you provide the difference between DataFrame and DataSet?

Answer: As we have seen previously, DataFrame’s can be seen as a DataSet[Row], where Row is a generic un-typed object. While DataSet is a collection of strongly-typed objects specified using a user-defined case classes.

DataFrame has un-typed objects, it means syntax error you can catch during compile time. But if there is any type mismatch, then it can only be caught during run-time.

DataSet, as it has strongly typed objects, it means both syntax as well as type mismatch error can be caught during compile time only. (If you know Java Generics, it is easy to understand concept)

Question 120: What is the Case classes, and why do you use it in Spark?

Answer: Case classes are Scala way of creating Java Pojo’s. It is implicitly create getter and setter method for the Object member. This is used in Spark to assign Schema to a DataSet/DataFrame objects.