Spark Interview Questions-34

Question 66: Can you use SQL to read data which are outside the Hive or Impala table?

Answer: If datafiles are outside of the Hive or Impala table, then you can use SQL directly to read JSON or Parquet data into DataFrame.

df = sqlContext.sql("SELECT * FROM json/parquet_file_data_dir")

Question 67: Which all storage are supported by Spark give few examples, you have used till now?

Answer: Spark cam access all the storage sources supported by Hadoop, including local filesystem, HDFS, HBSE, Amazon S3, and Microsoft ADLS.

Question 68: Give example of file types which are supported by Spark?

Answer: Spark support many file types, including text files, RCFiles, SequenceFiles, Hadoop InputFormat, Avro, Parquet and various compression of supported file types.

Question 69: Give sample API code to read/write files using Spark?

Answer: You can read compressed files using one of the following methods:

textFile(path)
hadoopFile(path,outputFormatClass)

You can save compressed files using one of the following methods:

saveAsTextFile(path, compressionCodecClass="codec_class")
saveAsHadoopFile(path,outputFormatClass, compressionCodecClass="codec_class")

Where codec_class is one of the classes in Compression Types.

Question 70: How do you access the data stored in AWS S3?

Answer: To access data stored in Amazon S3 from Spark applications, you use Hadoop file API for reading and writing RDDs. You need S3 bucket URL for that.

hadoopFile
saveAsHadoopFile
newAPIHadoopRDD
saveAsNewAPIHadoopFile

Details: Category: Spark; Last Updated: 24 April 2021

Related Articles

Spark Interview Questions-34