Question 3: How you can create DataFrame object?
Answer: There are many ways by which you can create DataFrame, but the below sources are commonly used.
- Structured (csv, tsv) data files.
- Hadoop Hive tables.
- RDBMS tables, SQL Queries output
- From an RDD
- It also support Avro, Parquet data formats. (Learn more about avro in module-39 herecom BigData Hadoop Training)
- You can enhance to use your custom format as well.
Question 4: Please describe something about Spark DataSet?
Answer: DataSet API was added in Spark 1.6, DataSet provides the benefit of both RDDs and the SparkSQL optimizer. You can create DataSet from a Java Objects and apply functional transformation on that using function like map, filter etc.
DataSet is a collection of stongly-typed objects, which are defined using a user-defined case classes.
Question 5: Can you provide the difference between DataFrame and DataSet?
Answer: As we have seen previously, DataFrame’s can be seen as a DataSet[Row], where Row is a generic un-typed object. While DataSet is a collection of strongly-typed objects specified using a user-defined case classes.
DataFrame has un-typed objects, it means syntax error you can catch during compile time. But if there is any type mismatch, then it can only be caught during run-time.
DataSet, as it has strongly typed objects, it means both syntax as well as type mismatch error can be caught during compile time only. (If you know Java Generics, it is easy to understand concept)
 
											