Question 1: How do you differentiate between Hadoop and Spark?
Answer: Hadoop is more of an eco-system, which provides a platform for storing (different data formats), compute engine, cluster management, distributed file system (HDFS). Whereas Spark is more of a compute engine only, which can well integrated with the Hadoop. Even, you can say Spark work as a one of the compute engine for Hadoop.
Spark does not have its own storage engine, but it can connect to various other storage like HDFS, Local File System, and RDBMS etc.
Question 2: What are the basic functionality of the Spark Core?
Answer: Spark core is the heart of the entire Spark engine, which provide various functionality as below.
- Managing memory pool
- Scheduling task on the cluster.
- Recovering from the failed jobs
- Can integrate various storage like RDBMS, HDFS, AWS S3 etc.
- Provides the RDD APIs, which are the basis for the higher-level API.
Spark core abstract the native API or lower level technicalities for the end user.
Question 3: If you have your existing queries created using Hive, what all changes you need to do to run them on Spark?
Answer: SparkSQL module has the capability to run queries in Spark, and also Spark SQL can use Hive Meta store as well, so there is no need to change anything to run hive queries in Spark SQL, even it can use UDFs defined by the Hive.
Question 4: What is the Driver Program in Spark?
Answer: This is one of the main program in Spark application, in case of REPL (spark shell will be driver program). Driver program create the instance of SparkSession(Spark 2.0) or SparkContext. It is the responsibility of the Driver program to communicate with the Cluster manager to distribute the task to cluster worker nodes.
Question 5: What is the role of cluster manager and what all cluster managers are supported by Spark?
Answer: Cluster manager, helps in managing the entire cluster worker nodes and communicate with the Driver Node as well. Spark support below three cluster manager.
- YARN : Yet another resource negotiator (Part of Hadoop eco-system) : Learn more
- Mesos
- Standalone cluster manager: This comes with the Spark itself, and good for testing and POC.