Question-6: What is the use of Apache Kudu?

Answer: Kudu is a Hadoop-native storage for fast analytics on fast data. It complements the capabilities of HDFS and HBase. 

 

Question-7: What is Cloudera CDH?

Answer: It is a distribution from Cloudera for Hadoop and its related projects. CDH is an open source product which include many projects few examples are below.

  • Hive
  • Impala
  • Kudu
  • Sentry
  • Spark

CDH is considered unified solution for the batch processing, Interactive SQL, interactive search, Machine Learning, statistical computation and role-based access control.

 

Question-8: Please tell me something about the Apache Hive?

Answer: Hive is a data warehouse solution for reading, writing and managing large datasets in distributed storage like HDFS using Hive Query Language (Almost same as SQL). These queries are converted into a series of jobs which execute on a Hadoop Cluster using either MapReduce or Spark.

 

Question-9: There are many tools available for querying the data, then why to use Hive?

Answer: Hive is a petabyte-scale data warehouse system which is built on the Hadoop platform. And one of the best available choices where you expect high growth of data volume. Hive on either MapReduce or Spark is best suited for batch data preparation or ETL.

 

Question-10: Can you please give me some use cases where Hive should be used?

Answer: Let’s see few of the below of the use cases

  • Suppose you have large ETL Sort and Join jobs to prepare the data for BI users in Impala then schedule such ETL jobs in the Hive. 
  • Suppose you have a Job where data transfer or conversion take many hours and possibility of job failure in between then do such activity using Hive, which can help you in recovering and continues where it left.
  • Various formats of the data, suppose you are receiving data in various formats then with the Hive SerDe and Variety of UDFs can help in converting data in single format.