Question 4: You are working with an e-commerce company which parse its log data and store it in the Redshift cluster table. Parsed data has columns like productId, IPaddress, username, orderValue, timestamp,

productname and orderId. Whenever any analytics is applied it always prefer the most recent order only for the analysis. From the list of given column which would you choose as a sort key for the given table?

1. OrderID

2. IP address

3. Username

4. Timestamp

5. productId

Correct Answer : 4 Exp : As question is clearly saying that, recent data are regularly fetched from the table for analysis and to get the recent data, you have to use the timestamp column. No other column could be

more helpful for this. You can even think for OrderId to sort the data based on orderId. But you really dont know what is the value of OrderId whether its numeric or alphanumeric. So we can not find whether it is a

value which increasing regularly based on the order placed. What AWS says

Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.

Queries are more efficient because they can skip entire blocks that fall outside the time range.

If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.

Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that dont apply to the predicate

range.

If you frequently join a table, specify the join column as both the sort key and the distribution key.

Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

4