Apache Spark Interview Questions-6

Question 26: How do you save data from an RDD to a text file?

Answer: You have to use RDD’s method saveAsTextFile(“destination_path”). Similarly for other file formats various other methods are available.

Question 27: What is Spark DataFrame and what are its basic properties?

Answer: Spark DataFrame, you can visualize as a table in Relational Databases. It has following features as well.

It is distributed over the Spark Clustered Nodes.
Data organized in columns.
It is immutable (to modify it, you have to create new DataFrame)
It is in-memory
You can applies schema to this data.
They also help you to have Domain Specific language (DSL)
They are evaluated lazily.

In one line you can say, DataFrames is an immutable distributed collection of data organized into named columns. DataFrame helps you take away the RDD’s complexity.

Question 28: What are the main difference between DataSet and DataFrame?

Answer: As you remember before Spark 1.6 DataFrame and DataSet were separate API, and they were unified in Spark 2.0. DataSet API is type-safe object and it can operate on to the compiled lambda function. DataFrame has un-typed objects, it means syntax error you can catch during compile time. But if there is any type mismatch, then it can only be caught during run-time.

DataSet, as it has strongly typed objects, it means both syntax as well as type mismatch error can be caught during compile time only. (If you know Java Generics, it is easy to understand concept)

Question 29: How do you read json file in Spark?

Answer: JSON is semi-structured data, and Spark provides the easy wat to read json data as below

spark.read.json(“hadoopexam.json”)

Question 30: What is Sequence File?

Answer: The best place to learn all the Hadoop file format is http://hadoopexam.com big data on-demand training, go and subscribe now.

SequenceFiles are also key-value pair, but they are having their key and values in binary format. And this is one of the most used Hadoop based FileFormat. Spark also provides API to conveniently using this FileFormat. SequenceFile contains header as well. Sync marker, helps reader to synchronize to a record boundary from any position in the File. Compression, you can enable two types of compression on SequenceFile, it could be either block level or record level compression. Recently parquet file format became more popular.

Details: Category: Spark; Last Updated: 24 April 2021

Related Articles

Apache Spark Interview Questions-6