Question 111: Describe LZO compression?

Answer: A free, open source compression library. LZO compression provides a good balance between data size and speed of compression. The LZO compression algorithm is the most efficient of the codecs, using very little CPU. Its compression ratios are not as good as others, but its compression is still significant compared to the uncompressed file sizes. Unlike some other formats, LZO compressed files are splittable, enabling MapReduce to process splits in parallel.

Question 112: Define MapReduce algorithm?

Answer: A distributed processing framework for processing and generating large data sets and an implementation that runs on large clusters machines.

The processing model defines two types of functions: a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

A MapReduce job partitions the input data set into independent chunks that are processed by the map functions in a parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce functions. Typically both the input and the output of the job are stored in a distributed filesystem.

The implementation provides an API for configuring and submitting jobs and job scheduling and management services; a library of search, sort, index, inverted index, and word co-occurrence algorithms; and the runtime. The runtime system partitions the input data, schedules the program's execution across a set of machines, handles machine failures, and manages the required inter-machine communication.

Question 113: Define a stage in Spark?

Answer: In Spark, a collection of tasks that all execute the same code, each on a different partition. Each stage contains a sequence of transformations that can be completed without shuffling the data.

Question 114: Define a task in Apache Spark application?

Answer: A unit of work on a partition of an RDD.

Question 115: What are the main components of the YARN architecture?

Answer: A general architecture for running distributed applications. YARN specifies the following components:

  • ResourceManager - A master daemon that authorizes submitted jobs to run, assigns an ApplicationMaster to them, and enforces resource limits.
  • ApplicationMaster - A supervisory task that requests the resources needed for executor tasks. An ApplicationMaster runs on a different NodeManager for each application. The ApplicationMaster requests containers, which are sized by the resources a task requires to run.
  • NodeManager - A worker daemon that launches and monitors the ApplicationMaster and task containers.
  • JobHistory Server - Keeps track of completed applications.

The ApplicationMaster negotiates with the ResourceManager for cluster resources—described in terms of a number of containers, each with a certain memory limit—and then runs application-specific processes in those containers. The containers are overseen by NodeManagers running on cluster nodes, which ensure that the application does not use more resources than it has been allocated.