Question 1: Why Spark, even Hadoop exists?
Answer: Below are few reasons.
- Iterative Algorithm: Generally MapReduce is not good to process iterative algorithms like Machine Learning and Graph processing. Graph and Machine Learning algorithms are iterative by nature and less saves to disk, this type of algorithm needs data in memory to run algorithm steps again and again or less transfers over network means better performance.
- In Memory Processing: MapReduce uses disk storage for storing processed intermediate data and also read from disks which is not good for fast processing. Because Spark keeps data in Memory (Configurable), which saves lot of time, by not reading and writing data to disk as it happens in case of Hadoop.
- Near real-time data processing (Refer: Module-23): Spark also supports near real-time streaming workloads via Spark Streaming application framework.
- Rich and Simple API: Spark 1.x to Spark 2.x lot of improvements are done. And mostly creating a rick API over DataSet and DataFrame. SQL support has been improved.
Question 2: Why both Spark and Hadoop needed?
Ans: Spark is often called cluster computing engine or simply execution engine. Spark uses many concepts from Hadoop MapReduce. Both Spark and Hadoop work together well. Spark with HDFS and YARN gives better performance and also simplifies the work distribution on cluster. As HDFS is storage engine for storing huge volume of data and Spark as a processing engine (In memory as well as more efficient data processing).
- HDFS (Refer: Module-2): It is used as a Storage engine for Spark as well as Hadoop.
- YARN (Refer Module-5): It is a framework to manage Cluster using pluggable scheduler.
- Run other than MapReduce: With Spark you can run MapReduce algorithm as well as other higher level of operators for instance map(), filter(), reduceByKey(), groupByKey() etc.
Question 3: How do you differentiate between Hadoop and Spark?
Answer: Hadoop is more of an eco-system, which provides a platform for storing (different data formats), compute engine, cluster management, distributed file system (HDFS). Whereas Spark is more of a compute engine only, which can well integrated with the Hadoop. Even, you can say Spark work as a one of the compute engine for Hadoop.
Spark does not have its own storage engine, but it can connect to various other storage like HDFS, Local File System, and RDBMS etc.
Question 4: What are the basic functionality of the Spark Core?
Answer: Spark core is the heart of the entire Spark engine, which provide various functionality as below.
- Managing memory pool
- Scheduling task on the cluster.
- Recovering from the failed jobs
- Can integrate various storage like RDBMS, HDFS, AWS S3 etc.
- Provides the RDD APIs, which are the basis for the higher-level API.
Spark core abstract the native API or lower level technicalities for the end user.
Question 5: What is the main improvement done in Spark 2.0 w.r.t. Spark 1.0.0?
Answer: Major improvement in Spark 2.0 is with regards to its API, SQL 2003 support, in streaming they have added structured streaming. Support of UDF in R language is also added.
- API: Dataset and DataFrame API is merged.
However, overall architecture of Spark 2.0 is same as Spark 1.0 . Core Spark internally still works on Direct Acyclic Graph and RDD.