Question 8: You want to create a 20 Node EMR cluster which is capable of storing 300GB data. However, your cluster does not run all the time, you need to run it once in two weeks. Also sometime you would entirely

delete the EMR cluster and create new one by changing the types of the nodes. However, you want feature that even if you delete your EMR cluster data must be available. Which of the following will satisfy the given

requirement?

1. You will be creating HDFS and use the Instance Store for that.

2. You will be using AWS EMRFS for that.

3. You will be creating HDFS storage layer using EBS volumes, which are attached to the EC2 instances.

4. You will be using DynamoDB for storage.

5. You will be using AWS Aurora for that.

Correct Answer : 2 Exp : As question is asking your data must be persisted even your cluster is down or terminated. Then you cannot use the storage which can be used with the EMR nodes. Hence, option-1 and option-3 is

out.

Now option-4 is talking about DynamoDB, it is a NoSQL solution and not a file based solution which is required by the AWS EMR. Hence, it cannot be a correct option.

Option-5: Aurora is an RDBMS solution, again it cannot be a correct option.

Only option remain is 2nd one and as per this option we should use EMRFS and what is EMRFS (it is Elastic Map Reduce File System) which is built on S3 and provides the capability similar to HDFS which is Hadoop

Distributed File System.

Amazon EMR and Hadoop provide a variety of file systems that you can use when processing cluster steps. You specify which file system to use by the prefix of the URI used to access the data. For example,

s3://myawsbucket/path references an Amazon S3 bucket using EMRFS. The following table lists the available file systems, with recommendations about when its best to use each one.

EMRFS is an implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use

with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.

2