Question 101: Which of the two resources used by the Spark application, but cannot be managed by neither YARN nor Spark?

Answer: The two main resources that Spark and YARN manage are CPU and memory. Disk and network I/O affect Spark performance as well, but neither Spark nor YARN actively manage them.

Question 102: When you deploy Spark on YARN cluster manager, how does ApplicationMaster memory comes into the picture?

Answer: The ApplicationMaster, which is a non-executor container that can request containers from YARN, requires memory and CPU that must be accounted for. In client deployment mode, they default to 1024 MB and one core. In cluster deployment mode, the ApplicationMaster runs the Spark application driver, so consider augmenting its resources with the --driver-memory and --driver-cores flags.

Question 103: What is Parquet file?

Answer: Parquet is an open source, column-oriented binary file format for Hadoop that supports very efficient compression and encoding schemes (Spark 2.0 also started using extensively). Parquet allows compression schemes to be specified on a per-column level, and allows adding more encodings as they are invented and implemented. Encoding and compression are separated, allowing Parquet consumers to implement operators that work directly on encoded data without paying a decompression and decoding penalty, when possible.

Question 104: When you submit your application to the Spark cluster, do you provide Hadoop and Spark jar with it?

Answer: No, we don’t have to include Spark and Hadoop jars with the application jar because they are available during runtime.

Question 105: What is Beeswax application?

Answer: Beeswax is a Hue application, using that you can perform queries on Hive, even you can create tables, load data, and run and manager Hive queries.