Question: I was going through the training Material from HadoopExam.com and got little confused with this three things DataSet, DataFrame and RDD. Can you please explain me the difference among these three objects?

 

Answer: RDD is the core for Spark Framework whether you are using Spark 1.x or Spark 2.x. Keep this thing in mind that whether you are using DataFrame or DataSet they are eventually converted into the RDD by the Spark Framework (This is done by the Catalyst Optimizer, which is introduced as part of Spark 2.x). 

Now lets learn difference between DataFrame and DataSet

 

Do you know the basic difference between Python and Scala. Python is Dynamic Type of Language and Scala is static type of language. 

 

  • Dynamic Type:  It means data type of any variable is derived by the Language itself and you dont have to explicitely mention the type of the variable. As in case of Python. 
  • Static Type: In this case you have to specify the types of variable e.g. whether its an String object or Integer object etc. 

 

In case of DataFrame and DataSet

DataFrame = DataSet[Row]

It means DataFrame is the DataSet only but it is equivalent to the Collection of Row object. 

 

DataFrame V/S RDD

DataFrame you can visualize as a table or Rows and Columns with additional Data Type and other metadata information. Which can help Catalyst Optimizer to convert the DataFrame efficiently in RDD and Since Spark2.x it is clearly mentioned if your data is in structured format use the DataFrame and not the RDD. Because DataFrame are much more efficient than RDD.

DataFrame = Data (In Row and Column Format) + Data Types + MetaData

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

We can convert RDD to DataFrame or Vice Versa if required. 

Apache Definitions of RDD: RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

To learn RDD API and For Hands On session we recommend this training from Spark Core training on http://HadoopExam.com

 

DataFrame:

Similar to RDD, it is also distributed and immutable collections of data. You can imagine DataFrame as an RDBMS table with column name and rows. But DataFrame rows are divided and saved across various machines in Spark cluster as shown in below image.

Figure 19: Partitioned DataFrame object across cluster nodes

  • DataFrame helps in writing SparkSQL code using simpler API, and it is very similar to Python and R DataFrame.
  • DataFrame is higher level abstraction of RDD.
  • DataFrame represents Dataset with the generic Row object. So you can have below similarities between Dataset and DataFrame.

DataFrame == Dataset<Row>

 

Here Row is a generic object, and does not have type information attached to it.

Whenever you work with Dataset or DataFrame you are working with the Row objects. In case of DataFrame it can be generic Row object and in case of Dataset it will be typed Dataset objects.

Even, you can apply schema information to DataFrame object as well. To work with DataFrame you have following two approaches.

  • SQL queries
  • Query DSL (It can check the syntax at compile time)

Programmatically assigning schema: You will be using this approach when

  • Schema needs to be created dynamically based on some conditions or requirement.
  • If total number of fields are more than 22

For creating schema programmatically, we have to use following Spark classes, specific to Schema

  • StructType
  • StructFields

Where StructType is a sequence of StructFields. It can be done as below

var heDF = spark.read.format(“csv”).schema(customSchemaString).load(“csv file path”).toDF(“columnNames String”)