Question 11: You are having a 30 node EMR cluster and processing 1TB data as of now and every day 10GB data is added. After few days you are facing storage issue and you are about to reach 90% storage usages of the

cluster and you need to increase the cluster storage size. Cluster is also using replication factor value as 3. Which of the below option you see is an ideal one so that overall storage of the cluster can be increased?

A. We can reduce the replication factor to 2.

B. We can add few more node to the cluster.

C. We can create an additional S3 bucket and attach it to cluster.

D. We can de-compress all the data on the cluster

1. A,B

2. B,C

3. C,D

4. A,D

5. B,D

Correct Answer : 1 Exp : As you can see EMR cluster is having replication factor as 3, it means there are three copies of the data. Hence, if you reduce the replication factor to 2, so that overall cluster storage can

be increased by 33%. However, reducing the replication factor can also put your data copy at risk. It means if you lose both the copy of the data, you will not have any means by which you can re-cover your data.

Now there is other option as well, rather than reducing the replication factor, you can add more nodes to the cluster and each node comes up with the additional space. Hence, overall cluster storage size will

increase. There are other means as well by which you can increase the HDFS storage size as below.

1. You can create cluster with the additional EBS volume or by adding instance groups with the attached EBS volumes to an existing cluster.

2. By adding more core nodes to the cluster.

3. You can choose bigger EC2 instance which have more storage capacity.

4. You can use data compression.

5. Reduce the replication factor by changing the replication factor values in the configuration file.

As you can see option C and D cannot be a correct option.

1