Question-20: You want to store all the subscriptions detail for the users subscribed the courses on HadoopExam.com. However, we want to group the users based on the course type. Sample data as below.
As you know that Analytics group has the highest number of users like more than a million and BigData group has few thousands of users. Which of the following table design is suitable so that data can be evenly spread across 5 node cluster as well as you would always query data for groups having CourseGroup as part of condition and data should be ordered based on username?
Here, hash_prefix holds a prefix of a hash of the username. Which is first byte of the hash modulo four.
And also, there are 1000’s of CourseGroup in data.
- CREATE TABLE HE_GROUP (
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((coursegroup), username)
)
- CREATE TABLE HE_GROUP(
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((coursegroup, hash_prefix), username)
)
C.
CREATE TABLE HE_GROUP(
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((coursegroup, email ), username)
)
CREATE TABLE HE_GROUP(
coursegroup text,
username text,
email text,
first text,
last text,
location text,
hash_prefix int,
PRIMARY KEY ((username, email), username)
)
Ans: B
Exp: As question is asking that data should be queried based on the coursegroup, hence we should use the coursegroup in the primary key. As well data should be ordered based on username in a partition. Hence, we will be using username as clustering key. In every option we do have this.
Option-4 is out because coursegroup is not used as partition key.
Now the point is we have data is not balanced across the group Analytics group has millions of users and BigData group has few thousands of users. And we want data should be balanced across all the nodes in the cluster. Hence, we have to use hash_prefix of the username in partition key, which creates compound primary key.
Here, hash_prefix holds a prefix of a hash of the username. And you can use first byte of the hash modulo four. Both coursegroup and hash_prefix makes the compound partition key. In this case individual group will reside across multiple partition (4 in this case). And data would be evenly spread across the nodes in the cluster.
However, we also have another issue here. You should not have data across the partitions while using single query. But it satisfy the requirement asked in the question. Hence, we always have to find the good balance for particular use cases. If you do a lot of reads and groups don’t get too large, maybe changing the modulo value from four to two would be a good choice. On the other hand, if you do very few reads, but any given group can grow very large, changing from four to ten would be a better choice.