Question 46: How do you relate DataFrame and DataSet?

Answer: You can assume DataFrame is a DataSet organized into named columns. In Sacla and Java you can represent DataFrame as a collection of generic object DataSet[Row], in this Row is un-typed object.

Scala

Dataset[Row]

Java

Dataset<Row>

 

Question 47: In what use cases, Apache Spark fits good?

Answer: Spark is good for both Interactive as well as batch mode.

Question 48: Can you describe which all projects of Spark you have used?

Answer: Spark has many other project than Spark Core as below

  • Spark SQL: This project help you to work with structured data, you can mix both SQL queries and Spark programming API, for your expected results.
  • Spark Streaming: It is good for processing streaming data. It can help you to create fault-tolerant streaming applications.
  • MLib: This API is quite reach for writing machine learning applications. You can use Python, Scala, R language for writing Spark Machine Library.
  • GraphX: API for graphs and graph parallel computations.

Question 49: What is the difference, when you run Spark Applications either on YARN or standalone cluster manager?

Answer: When you run Spark Applications using YARN then Application processes are managed by the YARN Resource Manager and Node Manager.

Similarly when you run on Spark standalone, then application processes are managed by Spark Master and Worker Nodes.

Question 50: Can you write simple word count application using Apache Spark either Scala or Python, as you have hands-on experience?

Answer: It is sometime asked to write very simple Spark applications, to check a person actually worked on Spark or not. It is not mandatory that you have written correct syntax.

Example for Scala code

val hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

val counts = hadoopExamData.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts. saveAsTextFile("hdfs://hadoopexam:8020/output.txt")

 

Example in Python

hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)

counts.saveAsTextFile("hdfs://hadoopexam:8020/output.txt")