Question 66: Can you use SQL to read data which are outside the Hive or Impala table?

Answer: If datafiles are outside of the Hive or Impala table, then you can use SQL directly to read JSON or Parquet data into DataFrame.

df = sqlContext.sql("SELECT * FROM json/parquet_file_data_dir")

 

Question 67: Which all storage are supported by Spark give few examples, you have used till now?

Answer: Spark cam access all the storage sources supported by Hadoop, including local filesystem, HDFS, HBSE, Amazon S3, and Microsoft ADLS.

Question 68: Give example of file types which are supported by Spark?

Answer: Spark support many file types, including text files, RCFiles, SequenceFiles, Hadoop InputFormat, Avro, Parquet and various compression of supported file types.

Question 69: Give sample API code to read/write files using Spark?

Answer: You can read compressed files using one of the following methods:

  • textFile(path)
  • hadoopFile(path,outputFormatClass)

You can save compressed files using one of the following methods:

  • saveAsTextFile(path, compressionCodecClass="codec_class")
  • saveAsHadoopFile(path,outputFormatClass, compressionCodecClass="codec_class")

Where codec_class is one of the classes in Compression Types.

Question 70: How do you access the data stored in AWS S3?

Answer: To access data stored in Amazon S3 from Spark applications, you use Hadoop file API for reading and writing RDDs. You need S3 bucket URL for that.

  • hadoopFile
  • saveAsHadoopFile
  • newAPIHadoopRDD
  • saveAsNewAPIHadoopFile