Question-76: During the aggregation sometime Impala uses the disk space as well, why?

Answer: Because Impala currently supports only In-memory hash aggregation. In Impala 2.0 onwards if the memory requirements for join or aggregation operation exceed the available memory limit on a particular host then it uses the work area on the disk to help the query for completing successfully. 

 

Question-77: Which all metadata are used by Impala currently?

Answer: There are two types of metadata is being used currently

  • Catalog information from Hive Metastore
  • File Metadata from the NameNode

Both these metadata lazily populated (means whenever they are needed) and then cached. Using the REFRESH statement, we can update the metadata for a particular table. Using the INVALIDATE METADATA statement all metadata are refreshed. Hence, Impala can recognize the new tables or DDL and DML changed done through Hive. In Impala 1.2 or later these statements are not needed because a daemon named catalogd broadcasts metadata changes to Impala. 

 

Question-78: For what Impala uses the NameNode?

Answer: Impala connects with the NameNode during the planning phase to get the file metadata to send the query on the host which has the data. Every impalad will read files as part of normal query processing. 

 

Question-79: With the design perspective, can you tell me why Impala is considered faster query engine?

Answer: There are many reasons because of that Impala is faster than other Hadoop components.

  • Impala does not use the MapReduce because MapReduce has few processing inefficiencies. 
  • Impala does not materialize intermediate results to disk. 
  • Impala does not have start-up time because it does not use the MapReduce. 
  • Impala runs as a service and essentially does not have start-up time. 
  • Impala does not create a pipeline of Map & Reduce job to run the query, rather disperse query plans. And avoid all the overheads of sort and shuffle phase when not needed. 

 

Question-80: What all Hardware feature used by Impala?

Answer: Impala uses the more efficient execution engine by taking advantage of modern Hardware and technologies, 

  • Impala generates runtime code and uses the LLVM to generate assembly code for the query that is being run. 
  • Whenever, possible Impala uses the available hardware instructions.
  • Impala uses better I/O scheduling, because it is aware about the location of the data block by reading the metadata from NameNode.
  • Many other things have been taken care as below
    • Tight inner-loops
    • Inline function calls
    • Minimum branching
    • Better use of cache
    • Minimal memory usage