Hadoop Interview Question and Answer

About Hadoop

If you really want to work on the BigData project and on always active framework then i think Hadoop is on of the best choice. You have already chosen this book it means, you have already started your journey.

Hadoop is in the Industry as of 2019-20 and already completed more than a decade, it is still highly active product and many investment banks, Healthcare IT companies, Giant retail chains, travel, entertainment and gaming companies are using Hadoop framework in production. Myself attended many interviews across this industry in India in city like Mumbai, Bangalore, Chennai and Hyderabad.

There are many companies which are trying to compete with the Hadoop framework using their own custom product and but still Hadoop wins the race. And another open source framework like Apache Spark , AWS, Azure, Google Cloud trying to compete using their cloud based solutions. But Hadoop wins in the many places. And prove itself that not all the components can be replaced from Hadoop ecosystem. Even they start relying on the Hadoop eco-system components like Hive, Pig, HDFS and HBase few are the examples. If you go for enterprise solution like Cloudera and Hortonworks then they are no doubt superb.

Hadoop framework mainly has two sub-framework as of its core engine.

MapReduce
HDFS

Yes, MapReduce had an hit from the Spark Execution Engine. Because Spark provide much faster execution engine, by keeping most of the possible data in memory and this is not the only reason. Spark has developed new execution engine, which is based on following two framework

Catalyst Optimizer : Check Module-2 on HadoopExam.com , this is a Spark Own extensible optimizer. Where you can add your own optimizer as well.
Project Tungsten : Check Module-3 : This is the project where Spark has done lot of things so that it can use the CPU caches like L1, L2 and L3. In these module, we have explained all the detail in depth.

However, using the MapReduce compute engine two popular framework developed, which are below

BigData Data Warehouse solution : Apache Hive
BigData Data Pipeline solution : Using Apache Pig

Many of the organizations are using these two popular framework in their production environment. And many application have already been developed and regularly added new applications as well. These two frameworks help in creating ETL pipeline as well more advanced analytical functions created for Big Data warehouse load.

Another popular component for Hadoop is HDFS (Hadoop Distributed File System) and no framework can replace its security and capability at least as of now. And it remain most popular solution in the industry for creating Data lake using Structured and Unstructured data. Even all the framework like Microsoft Big Insight , Databricks DBFS (Databricks File System), HD Insight and MapR File System uses the HDFS solution only. Two pioneer company like Hortonworks and Cloudera (These two companies are now merged) and heavily uses the HDFS to provide Data Lake solutions to the various companies across the globe.

Because in finance, Healthcare, medical industry there are regulatory requirement which wants that data must not go outside the Companies own Data center or remain in a particular country. So that companies are bound to use HDFS based storage solution to keep the data in-house.

Many organization have created their own framework as i mentioned previously using Apache Hadoop. Below three are the core companies which are involved in creating solution using the Hadoop framework as well as coding the open source Hadoop framework.

Cloudera Inc
Hortonworks Inc
MapR Inc

Other big technology giants like Microsoft and IBM have created framework using the same Apache Hadoop and provide Hadoop based solution in their respective cloud like Azure for Microsoft.

Even cloud unlimited storage solution like AWS (Amazon Web Service) S3 , Google Cloud, Alibaba cloud, are also not able to replace the HDFS storage engine. Because cloud storage solution always have security concern as well as regulatory issues and in the long run they are more costly for the BigData storage and access these data, specifically for the Big organizations.

You may be interested in following learning material for Hadoop framework and certifications.