Question 13: How we can use DataFrame/DataSet API with the structured streaming?

Answer: Spark 2.0 onwards, DataFrame and DataSet API has been enhanced, so that they can consider static data, bounded data, streaming data, unbounded data same as for static data. So using common entry point of SparkSession can help you to work on streaming data as well by applying same operations/API.

Question 14: Can you give some example of Streaming data sources?

Answer: Spark provides some of the built in data sources, for the components which are quite popular and used ubiquitously for example.

  • File: Any new file you receive in a directory can be considered as a stream of data.
  • Kafka: Read data from Kafka messaging engine.
  • Socket: Reading text data from socket (only support UTF-8 data) as well avoid using in prod, because it does not provide end-to-end fault tolerance.
  • Rate: Generate fixed number of rows in every second.