Spark Interview Questions-30

Question 46: How do you relate DataFrame and DataSet?

Answer: You can assume DataFrame is a DataSet organized into named columns. In Sacla and Java you can represent DataFrame as a collection of generic object DataSet[Row], in this Row is un-typed object.

Scala	Dataset[Row]
Java	Dataset<Row>

Question 47: In what use cases, Apache Spark fits good?

Answer: Spark is good for both Interactive as well as batch mode.

Question 48: Can you describe which all projects of Spark you have used?

Answer: Spark has many other project than Spark Core as below

Spark SQL: This project help you to work with structured data, you can mix both SQL queries and Spark programming API, for your expected results.
Spark Streaming: It is good for processing streaming data. It can help you to create fault-tolerant streaming applications.
MLib: This API is quite reach for writing machine learning applications. You can use Python, Scala, R language for writing Spark Machine Library.
GraphX: API for graphs and graph parallel computations.

Question 49: What is the difference, when you run Spark Applications either on YARN or standalone cluster manager?

Answer: When you run Spark Applications using YARN then Application processes are managed by the YARN Resource Manager and Node Manager.

Similarly when you run on Spark standalone, then application processes are managed by Spark Master and Worker Nodes.

Question 50: Can you write simple word count application using Apache Spark either Scala or Python, as you have hands-on experience?

Answer: It is sometime asked to write very simple Spark applications, to check a person actually worked on Spark or not. It is not mandatory that you have written correct syntax.

Example for Scala code

val hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

val counts = hadoopExamData.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts. saveAsTextFile("hdfs://hadoopexam:8020/output.txt")

Example in Python

hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)

counts.saveAsTextFile("hdfs://hadoopexam:8020/output.txt")

Details: Category: Spark; Last Updated: 24 April 2021

Related Articles

Spark Interview Questions-30