Question 8: How structured streaming are handled internally by Spark?

Answer: As we have mentioned previously that structured streaming is processed using Spark SQL engine only, which processes data stream as a series of small batch jobs. Having this micros batch helps in having latencies as low as 100 milliseconds and make sure that data will be processed exactly once (very important).

 

Question 9: Is there any option available, so that streaming data can be processed less than 100 milliseconds latencies?

Answer: In Spark 2.3, new feature is introduced for a low-latency data processing by changing the mode in streaming application, this new mode will be called “Continuous Processing”, by which end-to-end latencies can be as low as 1 millisecond with at-least once guarantee’s. Good point about this feature is that you don’t have to change your Dataset/DataFrame code, just change the mode to get low latency, if you need it in your application.

Question 10: What do you think, about its internal implementation of structured streaming, so that it can have same API?

Answer: The main important point here is, how the programming model is implemented in Spark structured streaming. In structured streaming, it consider live data as a table, which is continuously appended and code/program you write will be batch like only, hence Spark will query that data on that table, however queries are executed incrementally on this unbounded data table. For each query run, its like running query on static data.  Every new data, you can assume like that new rows are being appended to the existing unbounded table.

 

Question 11: What is event-time data, and what is the use?

Answer: Whenever, you receive data, it may or may not contain time embedded with the message contents. If time is embedded then it is called event-time data. Suppose you want to calculate how many events are generated in last 5 minutes. It may be possible that events received are in different order or duplicates. System can use this event-time embedded in the message contents. Even sometime because of some reason like network failure, event received is quite late. System can use this embedded event time to get to know the exact time of events. It is very useful in IOT world.

Question 12: What is the use of Watermarking in structured streaming?

Answer: As we have discussed in previous Question that events can be received in any order and time may be embedded with the event itself. In Spark 2.1, it is defined that you can specify watermark value, if message/event is older than this many seconds then discard it.