1. What is Data locality / placement?

Ans: Spark relies on data locality or data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.

 

With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits ), and then schedules the work to the SparkWorkers. Spark’s compute nodes / workers should be running on storage nodes.

 

  1. What is master URL in local mode?

Ans: You can run Spark in local mode using local , local[n] or the most general local[*].

The URL says how many threads can be used in total:

  • local uses 1 thread only.
  • local[n] uses n threads.
  • local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).

 

  1. Define components of YARN?

Ans: YARN components are below

ResourceManager: runs as a master daemon and manages ApplicationMasters and NodeManagers.

ApplicationMaster:  is a lightweight process that coordinates the execution of tasks of an application and asks the ResourceManager for resource containers for tasks. It monitors tasks, restarts failed ones, etc. It can run any type of tasks, be them MapReduce tasks or Giraph tasks, or Spark tasks.

NodeManager offers resources (memory and CPU) as resource containers.

NameNode

Container:  can run tasks, including ApplicationMasters.

 

 

  1. What is a Broadcast Variable?

Ans: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

 

  1. How can you define Spark Accumulators?

Ans: This are similar to counters in Hadoop MapReduce framework, which gives information regarding completion of tasks, or how much data is processed etc.