Question 126: Please list down some advantages of the Structured Streaming compare to DStream?

Answer: As we have seen in previous question DStream has various issues, that’s the reason Structured Streaming introduced in the Apache Spark

  • It is Fast
  • Fault-tolerant
  • Exactly-once stream processing approach
  • Input data can be thought as a append only table (grows continuously)
  • Trigger: It can specify a trigger, which can check for input data in a defined time interval.
  • API: The high level API is built on the SparkSQL engine and is tightly integrated with SQL queries, DataFrame and DataSet APIs.

Question 127: You have got a complaint from the security department that AWS bucket access credentials are visible in log file, where is the mistake?

Answer: It is because, credentials are stored in non-recommended way. You might have either of the below method used for providing credential for access S3 bucket

  • Specified credentials during runtime, using configuration properties, something like that

sc.hadoopConfiguration.set("fs.s3a.access.key", "...")

 

  • Or you have configured these credentials in the core-site.xml file.

Both of the above configurations are not recommended, if you want complete security to your data. Rather use Hadoop Credential Provider.

 

Question 128: Can you give some examples of Special purpose engines, which can work as a data source for Spark and also what kind of functionality they can provide?

Answer: Spark support below Special purpose engines

  • ElasticSearch : Good for srearch
  • Kafka : Good for messaging system
  • Redis : Good for caching

Question 129: You need to select small dataset from stored data, however, it is not recommended that you use Spark for that, why?

Answer: You should not use Spark for such use cases, because, Spark has to go through the all stored files and then find your result from it. You should consider using RDBMS or some other storage which index particular columns of the data and data retrieval would be faster in that case.

Question 130: You have created DataFrame from multiple input files and want to save into single output file, what SQL you use?

Answer: You have to use select * and coalesce command of DataFrame.