Question 11: Do you see any pain point with regards to Spark DStream?

Answer: There are some common issues with the Spark DStream

  • Timestamp: It consider timestamp, when event entered into the Spark system, rather than attached timestamp of the event.
  • API: You have to write different code for both batch and steam processing.
  • Failure condition: Developer has to manage various failure conditions.

Question 12: Please list down some advantages of the Structured Streaming compare to DStream?

Answer: As we have seen in previous Question DStream has various issues, that’s the reason Structured Streaming introduced in the Apache Spark

  • It is Fast
  • Fault-tolerant
  • Exactly-once stream processing approach
  • Input data can be thought as a append only table (grows continuously)
  • Trigger: It can specify a trigger, which can check for input data in a defined time interval.
  • API: The high level API is built on the SparkSQL engine and is tightly integrated with SQL queries, DataFrame and DataSet APIs.

Question 13: What is the use of Write-Ahead-Log?

Answer:

Question 15: Can you give some examples of Special purpose engines, which can work as a data source for Spark and also what kind of functionality they can provide?

Answer: Spark support below Special purpose engines

  • ElasticSearch : Good for srearch
  • Kafka : Good for messaging system
  • Redis : Good for caching

Question 16: You need to select small dataset from stored data, however, it is not recommended that you use Spark for that, why?

Answer: You should not use Spark for such use cases, because, Spark has to go through the all stored files and then find your result from it. You should consider using RDBMS or some other storage which index particular columns of the data and data retrieval would be faster in that case.

Question 17: You have created DataFrame from multiple input files and want to save into single output file, what SQL you use?

Answer: You have to use select * and coalesce command of DataFrame.

Question 6: What is structured streaming?

Answer: If you know, previously streaming data is processed using DStream and its API. But with the Structured streaming stream data will be processed using built in Spark SQL engine.

Question 7: What is the benefits of using structured streaming using Spark SQL?

Answer: If you see previous version of Spark, you need to learn different API for processing streaming data, which makes things difficult/Harder for developer. But with structured streaming

  • Computation: You write streaming computation same as batch computation.
  • Continuous data handling : Spark SQL engine take care of running streaming data computation, and updating final result as data arrives continuously.
  • DataFrame/DataSet API: You can use the same DataFrame/DataSet API for streaming aggregations, stream data and batch data joins.
  • Fault-tolerance: System ensures end-to-end exactly once fault-tolerance guarantee’s through checkpointing and write-ahead-log (WAL).