Question 46: Can you please explain DataSet in more detail?

Answer: It is better to take example to understand in detail. It is very important thing you need to learn.

I will take example in Scala to explain it.

Data in HadoopExam.json file.

 

{"course_id": 100, "course_name": "Spark Training", "Fee": "6000", "Location": "New York"}

{"course_id": 101, "course_name": "Hadoop Training", "Fee": "7000", "Location": "Mumbai"}

{"course_id": 102, "course_name": "NiFi Training", "Fee": "8000", "Location": "Pune"}

 

Create Case Class, which represent above data, each row.

case class HadoopExam (course_id: Long,  course_name: String, Fee: Long,   location: String)

 

Read the json file and create the dataset.

Use case class HadoopExam

Dataset is now a collection of JVM Scala objects HadoopExam

val dataset = spark.read.json(“hadoopexam.json”).as[HadoopExam]

 

Three things happen here under the hood in the code above:

  • Spark reads the JSON, infers the schema, and creates a collection of DataFrames.
  • At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.
  • Now, Spark converts the Dataset[Row] -> Dataset[HadoopExam] type-specific Scala JVM object, as dictated by the class HadoopExam.

With Dataset as a collection of Dataset[ElementType] typed objects, you seamlessly get both compile-time safety and custom view for strongly-typed JVM objects. And your resulting strongly-typed Dataset[T] from above code can be easily displayed or processed with high-level methods.

Question 47: In what use cases, Apache Spark fits well?

Answer: Spark is good for both Interactive as well as batch mode.

Question 48: Can you describe which all projects of Spark you have used?

Answer: Spark has many other project than Spark Core as below

  • Spark SQL: This project help you to work with structured data, you can mix both SQL queries and Spark programming API, for your expected results.
  • Spark Streaming: It is good for processing streaming data. It can help you to create fault-tolerant streaming applications. Spark streaming improved and new structured streaming is created which uses SparkSQL engine, more detail you will find in later question.
  • MLib: This API is quite reach for writing machine learning applications. You can use Python, Scala, R language for writing Spark Machine Library.
  • GraphX: API for graphs and graph parallel computations.

Question 49: What is the difference, when you run Spark Applications either on YARN or standalone cluster manager?

Answer: When you run Spark Applications using YARN then Application processes are managed by the YARN Resource Manager and Node Manager.

Similarly when you run on Spark standalone, then application processes are managed by Spark Master and Worker Nodes.

Question 50: Can you write simple word count application using Apache Spark either Scala or Python, as you have hands-on experience?

Answer: It is sometime asked to write very simple Spark applications, to check a person actually worked on Spark or not. It is not mandatory that you have written correct syntax.

Example for Scala code

val hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

val counts = hadoopExamData.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts. saveAsTextFile("hdfs://hadoopexam:8020/output.txt")

 

Example in Python

hadoopExamData = sc.textFile("hdfs://hadoopexam:8020/quicktechie.txt")

counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)

counts.saveAsTextFile("hdfs://hadoopexam:8020/output.txt")