Question 5: You are using a Redshift cluster for analyzing and storing petabytes of data. You have c

Question 5: You are using a Redshift cluster for analyzing and storing petabytes of data. You have created various tables in the cluster which includes the fact tables and dimension tables. You want to select a table

distribution style such that it can minimize the impact of the re-distribution step by locating the data where it needs to be before the query is executed. Hence, you are distributing the fact table and one dimension

table on their common columns. Which of the following is the correct distribution style to use the fact table?

1. DISTKEY

2. Key

3. Data Key

4. Primary Key

5. None of the above

Correct Answer : 1 Exp : As per the AWS Documentation

When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of

the redistribution step by locating the data where it needs to be before the query is executed.

Distribute the fact table and one dimension table on their common columns.

Your fact table can have only one distribution key. Any tables that join on another key arent collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined and the size of the

joining rows. Designate both the dimension tables primary key and the fact tables corresponding foreign key as the DISTKEY.

Choose the largest dimension based on the size of the filtered dataset.

Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.

Choose a column with high cardinality in the filtered result set.

If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted predicate

to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.

Change some dimension tables to use ALL distribution.

If a dimension table cannot be collocated with the fact table or other important joining tables, you can improve query performance significantly by distributing the entire table to all of the nodes. Using ALL

distribution multiplies storage space requirements and increases load times and maintenance operations, so you should weigh all factors before choosing ALL distribution.

To let Amazon Redshift choose the appropriate distribution style, dont specify DISTSTYLE.

Details: Category: AWS Certified Big Data - Specialty; Last Updated: 30 November -0001

Related Articles