Question 15: You have a huge volume of application logs data which is accumulated since last six mon

Question 15: You have a huge volume of application logs data which is accumulated since last six months as well as on daily basis 2GB logs data are getting generated. You want to do apply the analytics on this

application logs data and for that you have provisioned 30 node EMR cluster. You need to run SQL queries on this data, which of the following tools are good for the given requirement?

A. Apache Hive

B. Presto

C. HCatalog

D. OOzie

E. HBase

1. A,B

2. B,C

3. C,D

4. D,E

5. A,E

Correct Answer : 1 Exp : As it is given that Huge volume of logs data is already available and needs to be analyzed. EMR cluster is already provisioned, main purpose of using EMR cluster is to run the various Jobs

like either using Tez engine or MapReduce. However, you dont want to write your own the complex MapReduce job, you need some tool which can convert the SQL queries in the MapReduce job and can run on the EMR cluster.

Hence, for that the most popular tool is Hive. Hive provides query interface to run the query on structured or semi-structured data with the schema defined for them. However, the query is not as per the ANSI standard,

but very close to that standard and even it provides various complex functions for this requirement. Hence, option-A is correct.

Similarly, Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources. Presto included in Amazon EMR release version 5.0.0 and later. Hence, option-B is also

correct.

Details: Category: AWS Certified Big Data - Specialty; Last Updated: 30 November -0001

Related Articles