S3 bucket. And also data should be saved in such a format that it increases the time to insight and can be easily queried. Which of the following are the format in which data should be generated so that it is possible
that incoming events can have missing fields as well?
A. JSON
B. XML
C. Parquet
D. ORC
E. CSV
1. A,B
2. B,C
3. C,D
4. D,E
5. A,E
Correct Answer : 3 Exp : You can use ORC or Parquet data format to store data in the S3 data lake solution. These formats are recommended because they are columnar storage formats and best for the performance and cost
savings when querying data in S3. When data saved in this format they can be easily and efficiently queried by the Athena, Redshift, Glue and EMR can process them efficiently. As you can see these formats are
specifically developed with the Hadoop framework to get advantage of its efficiency.
With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great
benefits that Athena does.
3