Question 14: You have huge volume of csv files in S3 bucket and same csv file you want to query using SQL. You already provisioned 10 node EMR cluster and Hive application for running Hive queries. However, data

stored in S3 bucket needs to have more than one schema and this schema should be persisted, and even EMR cluster is down or terminated this schema should be available. Which of the following is a best solution for

persisting the schema even after terminating the EMR cluster?

A. You have to use MySQL as a metastore for the Hive application.

B. You have to use Oracle RDS as a metastore for the EMR MapReduce application.

C. You will be updating the JDBC configuration in the hive-site.xml file.

D. You will be updating configuration in OOzie application.

E. You will be enabling Sqoop (SQL to Hadoop) application.

1. A,B

2. B,C

3. C,D

4. D,E

5. A,C

Correct Answer : 5 Exp : There are following requirement as per the question

- Data stored in csv format in S3 bucket can be queried using SQL. Hence for this Hive is the one of the correct solution using EMR.

- You should be able to represent same data with more than one schema. This can be possible using Hive Metastore. Hive metastore is any RDBMS solution to store the schema of the data. Hence, as per the given option,

option-A is correct. Because it is talking about MySQL RDBMS as a metastore for Hive application. There is no metastore required for EMR cluster. Hence we can discard option-B.

- Now we have confirmed that, we will be using MySQL as a metastore. We need to provide the JDBC connection for that. And we can use hive-site.xml file for this requirement. So option-C is also correct.

- OOZie is a solution to create workflow. Hence, we cannot consider it is as a correct option.

- Sqoop is a tool for migrating data from RDBMS to HDFS or Hive and vice versa. And not a solution for the MetaStore. Hence, we can discard this option.

5