Which of the following is useful to accomplish this efficiently?
1. You will be using PARALLEL UPLOAD command
2. You will be using COPY command with the manifest file
3. You will be using PARALLEL UPLOAD command with the manifest file
4. You will be creating a Map only job and run that job using EMR Cluster
5. You will be using Sqoop (Sql to Hadoop) utility
Correct Answer : 2 Exp : As per the AWS Documentation, you can see how COPY command can be used to load file individually or multiple files in parallel.
Use the COPY command to load a table in parallel from data files on Amazon S3. You can specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file.
The syntax to specify the files to be loaded by using a prefix is as follows:
copy <table_name> from s3://<bucket_name>/<object_prefix>
authorization;
The manifest file is a JSON-formatted file that lists the data files to be loaded. The syntax to specify the files to be loaded by using a manifest file is as follows:
copy <table_name> from s3://<bucket_name>/<manifest_file>
authorization
manifest;
The table to be loaded must already exist in the database.The values for authorization provide the AWS authorization your cluster needs to access the Amazon S3 objects.The preferred method for authentication is to
specify the IAM_ROLE parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary permissions. Alternatively, you can specify the ACCESS_KEY_ID and SECRET_ACCESS_KEY parameters and provide
the access key ID and secret access key for an authorized IAM user as plain text.
The following example shows authentication using an IAM role.
copy customer
from s3://mybucket/mydata
iam_role arn:aws:iam::0123456789012:role/MyRedshiftRole;
You can use a manifest to ensure that the COPY command loads all of the required files, and only the required files, for a data load. Instead of supplying an object path for the COPY command, you supply the name of a
JSON-formatted text file that explicitly lists the files to be loaded. The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix. You can use a manifest to load files
from different buckets or files that do not share the same prefix. The following example shows the JSON to load files from different buckets and with file names that begin with date stamps.
{
""entries"": [
{""url"":""s3://mybucket-alpha/2013-10-04-custdata"", ""mandatory"":true},
{""url"":""s3://mybucket-alpha/2013-10-05-custdata"", ""mandatory"":true},
{""url"":""s3://mybucket-beta/2013-10-04-custdata"", ""mandatory"":true},
{""url"":""s3://mybucket-beta/2013-10-05-custdata"", ""mandatory"":true}
]
}
The optional mandatory flag specifies whether COPY should return an error if the file is not found. The default of mandatory is false. Regardless of any mandatory settings, COPY will terminate if no files are found.
The following example runs the COPY command with the manifest in the previous example, which is named cust.manifest.
copy customer
from s3://mybucket/cust.manifest
iam_role arn:aws:iam::0123456789012:role/MyRedshiftRole
manifest;
2