1. What all are the data sources Spark can process?

Ans:

  • Hadoop File System (HDFS)
  • Cassandra (NoSQL databases)
  • HBase (NoSQL database)
  • S3 (Amazon WebService Storage : AWS Cloud)

 

  1. What is Apache Parquet format?

        Ans: Apache Parquet is a columnar storage format

 

  1. What is Apache Spark Streaming?

Ans: Spark Streaming helps to process live stream data. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

 

 

All Products link from www.HadoopExam.com

Question 3: If you have your existing queries created using Hive, what all changes you need to do to run them on Spark?

Answer:  SparkSQL module has the capability to run queries in Spark, and also Spark SQL can use Hive Meta store as well, so there is no need to change anything to run hive queries in Spark SQL, even it can use UDFs defined by the Hive.

Question 4: What is the Driver Program in Spark?

Answer: This is one of the main program in Spark application, in case of REPL (spark shell will be driver program). Driver program create the instance of SparkSession or SparkContext. It is the reponsbility of the Driver program to communicate with the Cluster manager to distribute the task to cluster worker nodes.

Question 5: What is the role of cluster manager and what all cluster managers are supported by Spark?

Answer: Cluster manager, helps in managing the entire cluster worker nodes and communicate with the Driver Node as well. Spark support below three cluster manager.

  • YARN : Yet another resource negotiator (Part of Hadoop eco-system) : Learn more
  • Mesos
  • Standalone cluster manager: This comes with the Spark itself, and good for testing and POC.