Question 66: Can you use SQL to read data which are outside the Hive or Impala table?
Answer: If datafiles are outside of the Hive or Impala table, then you can use SQL directly to read JSON or Parquet data into DataFrame.
df = sqlContext.sql("SELECT * FROM json/parquet_file_data_dir") |
Question 67: Which all storage are supported by Spark give few examples, you have used till now?
Answer: Spark cam access all the storage sources supported by Hadoop, including local filesystem, HDFS, HBSE, Amazon S3, and Microsoft ADLS.
Question 68: Give example of file types which are supported by Spark?
Answer: Spark support many file types, including text files, RCFiles, SequenceFiles, Hadoop InputFormat, Avro, Parquet and various compression of supported file types.
Question 69: Give sample API code to read/write files using Spark?
Answer: You can read compressed files using one of the following methods:
- textFile(path)
- hadoopFile(path,outputFormatClass)
You can save compressed files using one of the following methods:
- saveAsTextFile(path, compressionCodecClass="codec_class")
- saveAsHadoopFile(path,outputFormatClass, compressionCodecClass="codec_class")
Where codec_class is one of the classes in Compression Types.
Question 70: How do you access the data stored in AWS S3?
Answer: To access data stored in Amazon S3 from Spark applications, you use Hadoop file API for reading and writing RDDs. You need S3 bucket URL for that.
- hadoopFile
- saveAsHadoopFile
- newAPIHadoopRDD
- saveAsNewAPIHadoopFile