Question-6: You are working with a company which are having daily batch jobs to analyze the data of the website user, they are getting millions of users on their website on daily basis. And in the batch job they are doing some data cleanup, transformation and finally generate well formatted data. On this data some SQL queries are issued as part of the batch job to generate the reports. It is not necessary you would be running this batch processing on daily basis, sometime you may not run this batches, however whenever you run the data volume is around 4TB, which of the good solution for implementing this requirement
- You would be using AWS Kinesis Firehose, Kinesis Data Analytics and AWS Lambda
- You would be using S3 as a data storage and then create a EC2 cluster to process the data from the S3 bucket.
- You would be storing this data in the AWS DynamoDB and AWS Lambda on the data whenever required.
- You would be creating EMR cluster, and whenever jobs needs to be initiated you would be adding tasks nodes and as soon as processing finished you would be removing the task nodes.
Exp: EMR cluster is a good solution for processing Get the latest AWS Training, Certification Preparation Material, Books & Interview questions huge volume of data. It is basically a Hadoop cluster where you can add or remove nodes as need basis, even if you don’t need EMR cluster then terminate it and whenever needed again you would be spin up again, as more tasks needs to be completed then add more tasks nodes and when the tasks or batch finished remove the task nodes. With the EMR, you don’t need to guess your future requirements or provision for peak demand because you can easily add or remove capacity at any time.