61. Which all cluster manager can be used with Spark?
Apache Mesos, Hadoop YARN, Spark standalone and
Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution.
62. What is a BlockManager?
Ans: Block Manager is a key-value store for blocks that acts as a cache. It runs on every node, i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and offheap.
A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a cached RDD partition, intermediate shuffle data, and broadcast data.
63. What is Data locality / placement?
Ans: Spark relies on data locality or data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits ), and then schedules the work to the SparkWorkers. Spark’s compute nodes / workers should be running on storage nodes.
64. What is master URL in local mode?
Ans: You can run Spark in local mode using local , local[n] or the most general local[*].
The URL says how many threads can be used in total:
· local uses 1 thread only.
· local[n] uses n threads.
· local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
65. Define components of YARN?
Ans: YARN components are below
ResourceManager: runs as a master daemon and manages ApplicationMasters and NodeManagers.
ApplicationMaster: is a lightweight process that coordinates the execution of tasks of an application and asks the ResourceManager for resource containers for tasks. It monitors tasks, restarts failed ones, etc. It can run any type of tasks, be them MapReduce tasks or Giraph tasks, or Spark tasks.
NodeManager offers resources (memory and CPU) as resource containers.
Container: can run tasks, including ApplicationMasters.
66. What is a Broadcast Variable?
Ans: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
67. How can you define Spark Accumulators?
Ans: This are similar to counters in Hadoop MapReduce framework, which gives information regarding completion of tasks, or how much data is processed etc.
68. What all are the data sources Spark can process?
· Hadoop File System (HDFS)
· Cassandra (NoSQL databases)
· HBase (NoSQL database)
· S3 (Amazon WebService Storage : AWS Cloud)
69. What is Apache Parquet format?
Ans: Apache Parquet is a columnar storage format
70. What is Apache Spark Streaming?
Ans: Spark Streaming helps to process live stream data. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.