Question-71: Is Impala Single point of failure?

Answer:  No, all impala daemons are fully able to handle incoming queries, if a machine fails, all queries with fragments running on that machine will fail. Because queries are expected to return quickly, if there is any failure then you can re-run query. 

 

Question-72: How Impala, Hive, HDFS HA, NameNode are linked or related?

Answer: Impala uses the same Hive Metastore, and aggressively caches the metadata so the metastore host should have minimal load. Impala relies on the HDFS Namenode, and you can configure HA for HDFS. Impala has centralized services known as the statestore and catalog services, that run on one host only. Even if statestore host is down Impala continues to execute queries and would not get state updates.

 

Question-73: What happen Impala statestore is down?

Answer: Suppose a new host is added in the cluster when statestore is down then the existing instances of the impalad running on the other hosts will not find out about this new host. Once the statestore process is restarted, then all the information it serves is automatically reconstructed from all running Impala daemons. 

 

Question-74: Why it is recommended that Impala daemon should run on each DataNode?

Answer: It is highly recommended that impalad daemon should be running on each DataNode in the cluster to avoid any kind of remote data read and affecting the query performance. If possible, impala schedules query fragments on all hosts holding data relevant to the query. 

 

Question-75: Between Small and large table how joins are performed?

Answer: There are various strategies based on the size of the table’s joins are performed. When two tables are joined where one is a large table and another is small table then data from the small table would be transmitted to each node for intermediate processing. This is also known as broadcast join.

 

When both the tables are large than data from one table is divided into pieces, and each node processes only selected pieces.