Question 56: How Spark Streaming applications are impacted with Dynamic Allocation?

Answer: When Dynamic Allocation is enabled in Spark, which means that executors are removed when they are idle. However, Dynamic allocation is not effective in case of Spark Streaming. In Spark Streaming, data comes in every batch, and executors will run whenever data is available. If the executor idle timeout is less than the batch duration, executors are constantly added and removed. If executor idle timeout is greater than the batch duration, executors are never removed. Hence, it is recommended that you disable the Dynamic Allocation for Spark streaming by setting “spark.dynamicAllocation.enabled” to flase.

Question 57: When you submit Spark streaming application on local mode (not on Hadoop YARN), then it is must to have two threads, why?

Answer: As we have discussed previously, when Spark Streaming application is executed, it require at least two threads, one for receive data and one for processing that data.

Question 58: How do you enable Fault-tolerant data processing in Spark streaming?

Answer: If the Driver host for a Spark Streaming application fails, it can lose data that had been received but not yet processed. To ensure that no data is lost, you can use Spark Streaming recovery. Spark writes incoming data to HDFS as it is received and uses this data to recover state if a failure occurs.

Question 59: When you use Spark Streaming with the AWS S3 (or cloud) which storage is recommended?

Answer: When using Spark Streaming application, with the cloud services as the underline storage layer, use ephemeral HDFS on the cluster to store checkpoints, instead of the cloud store such as Amazon S3 or Microsoft ADLS.

Question 60: If you have structure data than, which Spark components you can use?

Answer: To work with structured data, we should use SparkSQL or DataFrame API.