Question-36: What is the JobTracker and what it performs in a Hadoop Cluster?

Answer: JobTracker is a daemon service which submits and tracks the MapReduce tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a separate machine, and each slave node is configured with job tracker node location.

The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions 

  1. Client applications submit jobs to the Job tracker.
  2. The JobTracker talks to the NameNode to determine the location of the data
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes.
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then:it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. 
  7. When the work is completed, the JobTracker updates its status.
  8. Client applications can poll the JobTracker for information.

Question-37: How a task is scheduled by a JobTracker?

Answer: The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker about the number of available slots, so the JobTracker can stay up to date with it and  where in the cluster work can be further delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack. 

Question-38: How many instances of Tasktracker run on a Hadoop cluster?

Answer:  There is one Daemon Tasktracker process for each slave node in the Hadoop cluster. 

Question-39: What are the two main parts of the Hadoop framework?

Answer: Hadoop consists of two main parts 

  1. Hadoop distributed file system, a distributed file system with high throughput,
  2. Hadoop MapReduce, a software framework for processing large data sets.

Question-40:  Explain the use of TaskTracker in the Hadoop cluster?

Answer: A Tasktracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in its own JVM Process. Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker.

The Tasktracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker.

The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.