Question 36: What all you need to load data from AWS S3 bucket in Spark?
Answer: You can download the data from AWS (Amazon Web Service) S3 bucket. You need following three things
- URL for file stored in a bucket
- AWS Accesss Key ID
- AWS Secret Access Keys
Once, you have this info, you can load data from S3 bucket as below.
sc.textFile("s3n://hadoopexam_bucket/data_file.txt") |
You can pass the key explicitly as well
sc.textFile("s3n://AWSAccessKey:AWSSecretKey@svr/filepath") |
Question 37: What all are the possible ways, of working with SparkSQL?
Answer: You can interact with SparkSQL using SQL, DataFrame and DataSet API. Whatever mechanism you use, underline execution engine remain same. However, since Spark 2.0 DataSet AP is preferred way.
Question 38: Which all SQL interaction are supported by SaprkSQL?
Answer: You can use both basic ANSI SQL as well as HiveQL with the SparkSQL. And, you can use both SQL and HiveQL to read data stored in Hive. You can also use command-line or JDBC/ODBC to interact with Spark SQL interface.
Question 39: What is a DataFrame?
Answer: It is a collection, which is distrusted over the nodes in Spark cluster. Similar to table DataFrame columns also have name. You can imagine it is a RDBMS tables (not exactly same) .
Question 40: What all are possible sources for creating DataFrame?
Answer: You can create DataFrame with variety of sources, structured data files, Hive tables, external databases (SQL and No-SQL), existing RDDs can be converted to DataFrame.