Question 49: You have a setup of Amazon Connect service for your organization, which consistently submit the events through Kinesis Data Stream and Kinesis Data Firehose. You need to create a data lake solution using

S3 bucket. And also data should be saved in such a format that it increases the time to insight and can be easily queried. Which of the following are the format in which data should be generated so that it is possible

that incoming events can have missing fields as well?

A. JSON

B. XML

C. Parquet

D. ORC

E. CSV

1. A,B

2. B,C

3. C,D

4. D,E

5. A,E

Correct Answer : 3 Exp : You can use ORC or Parquet data format to store data in the S3 data lake solution. These formats are recommended because they are columnar storage formats and best for the performance and cost

savings when querying data in S3. When data saved in this format they can be easily and efficiently queried by the Athena, Redshift, Glue and EMR can process them efficiently. As you can see these formats are

specifically developed with the Hadoop framework to get advantage of its efficiency.

With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great

benefits that Athena does.

3