Question-41: What do you mean by TaskInstance?

Answer: Task instances are the actual MapReduce jobs which run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the entire task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default, a new task instance JVM process is spawned for a task. 

Question-42: How many daemon processes run on a Hadoop cluster?

Answer: Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM. Following 3 Daemons run on Master nodes.

  • NameNode - This daemon stores and maintains the metadata for HDFS.
  • Secondary NameNode - Performs housekeeping functions for the NameNode.
  • JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Following 2 Daemons run on each Slave nodes

  1. DataNode – Stores actual HDFS data blocks.
  2. TaskTracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

Question-43: How many maximum JVM can run on a slave node?

Answer: One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically, a high-end machine is configured to run   more task instances.

Question-44: What is NAS?

Answer: It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS. 

Question-45: How HDFS differs with NFS?

  • In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
  • HDFS is designed to work with MapReduce System, since computation is moved to data. NAS is not suitable for MapReduce since data is stored separately from the computations.
  • HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.