application logs data and for that you have provisioned 30 node EMR cluster. You need to run SQL queries on this data, which of the following tools are good for the given requirement?
A. Apache Hive
B. Presto
C. HCatalog
D. OOzie
E. HBase
1. A,B
2. B,C
3. C,D
4. D,E
5. A,E
Correct Answer : 1 Exp : As it is given that Huge volume of logs data is already available and needs to be analyzed. EMR cluster is already provisioned, main purpose of using EMR cluster is to run the various Jobs
like either using Tez engine or MapReduce. However, you dont want to write your own the complex MapReduce job, you need some tool which can convert the SQL queries in the MapReduce job and can run on the EMR cluster.
Hence, for that the most popular tool is Hive. Hive provides query interface to run the query on structured or semi-structured data with the schema defined for them. However, the query is not as per the ANSI standard,
but very close to that standard and even it provides various complex functions for this requirement. Hence, option-A is correct.
Similarly, Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources. Presto included in Amazon EMR release version 5.0.0 and later. Hence, option-B is also
correct.
1