Question 71: You are accessing data stored in AWS S3, in Spark and on bucket TLS is enabled, then what you need?

Answer: If S3 bucket is TLS enabled and you are using custom jssecacerts truststore, make sure that your trustore includes the root Certificate Authority(CA) certificates that signed the Amazon S3 certificate.

Question 72: If your Spark is installed on EC2 instances, and trying to access data stored in S3 bucket, but you don’t want to provide the credential at the same time, how can you do that?

Answer: As we know, both EC2 and S3 are Amazon services, we can leverage IAM roles, In this mode of operation associates the authorization with individual EC2 instances instead of with each Spark app or the entire cluster.

Run EC2 instances with instance profiles associated with IAM roles that have the permissions you want. Requests from a machine with such a profile authenticate without credentials.

Question 73: Cloudera also provides a way to store AWS bucket credential to provide system-wide AWS access to a single predefined bucket, what is that?

Answer: Cloudera recommends that you use the Hadoop Credential Provider to set up AWS access because it provides system-wide AWS access to a single predefined bucket, without exposing the secret key in a configuration file or having to specify it at runtime.

Question 74: What are the ways, by which AWS bucket access can be controlled?

Answer: AWS access for users can be set up in two ways. You can either provide a global credential provider file that will allow all Spark users to submit S3 jobs, or have each user submit their own credentials every time they submit a job.

Question 75: You have got a complaint from the security department that AWS bucket access credentials are visible in log file, where is the mistake?

Answer: It is because, credentials are stored in non-recommended way. You might have either of the below method used for providing credential for access S3 bucket

  • Specified credentials during runtime, using configuration properties, something like that

sc.hadoopConfiguration.set("fs.s3a.access.key", "...")

 

  • Or you have configured these credentials in the core-site.xml file.

Both of the above configurations are not recommended, if you want complete security to your data. Rather use Hadoop Credential Provider.