Question 50: You are working in an e-commerce company, you need to create a Data lake Solution for that organization. You are receiving data on daily basis and store it in the Data Lake. However, before storing this

data in Data Lake, you need to do various processing like joining sales data with the marketing data and then save the result in the Data Lake. Which of the following would be helpful for given requirement?

A. You will be using AWS S3 as a Data Lake.

B. You will be using HDFS as a Data Lake

C. You will be using farm of EC2 servers as a Data Lake

D. You will be using AWS Lambda and Step Functions

E. You will be using AWS CloudTrail API to launch ETL jobs

F. You will be using SNS Topic to trigger an ETL job

1. A,B

2. C,D

3. D,E

4. A,D

5. C,F

Correct Answer : 4 Exp : In this question there are two objectives

1. Creating Data Lakes

2. Creating and triggering ETL jobs whenever data arrives. And using ETL jobs you process the data like joining sales and marketing data. Finally save this data in the Data Lake

To create a Data Lake, you can use AWS S3 storage. Which can store any volume of data and any format. Hence, option-A is correct.

If you would have been creating Data Lake in-house then you could consider using HDFS based solution. And as we are using AWS S3, then why we would be unnecessary EMR cluster and then create HDFS to create Data Lakes.

Hence, we can not consider the option-B

Similarly why we provision many EC2 server to just create storage. Rather we will be using already provided storage by AWS S3. Hence, option-C is also out.

Now we need to create ETL jobs and then we should be able to trigger that ETL jobs based on the data arrival events.

ETL job typically reads data from one or more data sources, applies various transformations to the data, and then writes the results to a target where data is ready for consumption. The sources and targets of an ETL

job could be relational databases in Amazon Relational Database Service (Amazon RDS) or on-premises, a data warehouse such as Amazon Redshift, or object storage such as Amazon Simple Storage Service (Amazon S3)

buckets. Amazon S3 as a target is especially commonplace in the context of building a data lake in AWS.

AWS offers AWS Glue, which is a service that helps author and deploy ETL jobs. AWS Glue is a fully managed extract, transform, and load service that makes it easy for customers to prepare and load their data for

analytics. Other AWS Services also can be used to implement and manage ETL jobs.

How can we orchestrate an ETL workflow that involves a diverse set of ETL technologies? AWS Glue, AWS DMS, Amazon EMR, and other services support Amazon CloudWatch Events, which we could use to chain ETL jobs

together. Amazon S3, the central data lake store, also supports CloudWatch Events. But relying on CloudWatch Events alone means that theres no single visual representation of the ETL workflow. Also, tracing the

overall ETL workflows execution status and handling error scenarios can become a challenge.

AWS Step Functions and AWS Lambda for orchestrating multiple ETL jobs involving a diverse set of technologies in an arbitrarily-complex ETL workflow. AWS Step Functions is a web service that enables you to coordinate

the components of distributed applications and microservices using visual workflows. You build applications from individual components. Each component performs a discrete function, or task, allowing you to scale and

change applications quickly.

4