Question 48: You are having 30 different applications hosted on the AWS, each generate the applicati

Question 48: You are having 30 different applications hosted on the AWS, each generate the application logs and submit that into 30 different S3 buckets. You need to regularly upload these data in a Redshift table.

Which of the following is useful to accomplish this efficiently?

1. You will be using PARALLEL UPLOAD command

2. You will be using COPY command with the manifest file

3. You will be using PARALLEL UPLOAD command with the manifest file

4. You will be creating a Map only job and run that job using EMR Cluster

5. You will be using Sqoop (Sql to Hadoop) utility

Correct Answer : 2 Exp : As per the AWS Documentation, you can see how COPY command can be used to load file individually or multiple files in parallel.

Use the COPY command to load a table in parallel from data files on Amazon S3. You can specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file.

The syntax to specify the files to be loaded by using a prefix is as follows:

copy <table_name> from s3://<bucket_name>/<object_prefix>

authorization;

The manifest file is a JSON-formatted file that lists the data files to be loaded. The syntax to specify the files to be loaded by using a manifest file is as follows:

copy <table_name> from s3://<bucket_name>/<manifest_file>

authorization

manifest;

The table to be loaded must already exist in the database.The values for authorization provide the AWS authorization your cluster needs to access the Amazon S3 objects.The preferred method for authentication is to

specify the IAM_ROLE parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary permissions. Alternatively, you can specify the ACCESS_KEY_ID and SECRET_ACCESS_KEY parameters and provide

the access key ID and secret access key for an authorized IAM user as plain text.

The following example shows authentication using an IAM role.

copy customer

from s3://mybucket/mydata

iam_role arn:aws:iam::0123456789012:role/MyRedshiftRole;

You can use a manifest to ensure that the COPY command loads all of the required files, and only the required files, for a data load. Instead of supplying an object path for the COPY command, you supply the name of a

JSON-formatted text file that explicitly lists the files to be loaded. The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix. You can use a manifest to load files

from different buckets or files that do not share the same prefix. The following example shows the JSON to load files from different buckets and with file names that begin with date stamps.

{

""entries"": [

{""url"":""s3://mybucket-alpha/2013-10-04-custdata"", ""mandatory"":true},

{""url"":""s3://mybucket-alpha/2013-10-05-custdata"", ""mandatory"":true},

{""url"":""s3://mybucket-beta/2013-10-04-custdata"", ""mandatory"":true},

{""url"":""s3://mybucket-beta/2013-10-05-custdata"", ""mandatory"":true}

]

}

The optional mandatory flag specifies whether COPY should return an error if the file is not found. The default of mandatory is false. Regardless of any mandatory settings, COPY will terminate if no files are found.

The following example runs the COPY command with the manifest in the previous example, which is named cust.manifest.

copy customer

from s3://mybucket/cust.manifest

iam_role arn:aws:iam::0123456789012:role/MyRedshiftRole

manifest;

Details: Category: AWS Certified Big Data - Specialty; Last Updated: 30 November -0001

Related Articles