spark|stion: Which all methods should be avoided, so

Mobile: +91-8879712614 Phone:022-42669636 | Email : hadoopexam@gmail.com admin@hadoopexam.com

All Products Spark IBM MapR Hortonworks Cloudera NiFi Amazon AWS SAS HBase Cassandra Salesforce Oracle Cloud & Java Android To Activate Free Resources Forum Subscribe Annual Subscription (50%+49% off) Author/Trainer For Business Blog

25000+ Learners upgraded/switched career Testimonials

All Certifications preparation material is for renowned vendors like Cloudera, MapR, EMC, Databricks,SAS, Datastax, Oracle, NetApp etc , which has more value, reliability and consideration in industry other than any training institutional certifications.
Note : You can choose more than one product to have custome package created from below and send email to hadoopexam@gmail.com to get discount.Premium Trainings Courses : HadoopExam focuses on in depth learning with the hands-on session setting up the environment than executing solution and doing hands on that. Below are the available trainings and we are keep adding new trainings. These trainings is being used and subscribed by Devloper, Tester, Administrator, Enterprise(to train their team) and Trainer globally. These trainings are well organized and step by step solutions to learning, and in lesser time as per your convenience you can complete these and even re-visit as required.

All Premium Training Access Annual Subscription (You will get early access to under development training and early edition books) : Used By More than 20000 subscribers

Access All Annual/Semi Annual/Quarterly Subscription from this Link

Previous | Next | Audio Book for Spark Interview Questions is available here | Top 150 Latest Spark Interview Questions | Quickly go through Spark Training Python & Scala

Question: Which all methods should be avoided, so less amount of data shuffling happens across the partitions?

Answer: When choosing an arrangement of transformations, minimize the number of shuffles and the amount of data shuffled. Shuffles are expensive operations. All shuffle data must be written to disk and then transferred over the network. repartition, join, cogroup , and any of the *By or *ByKey transformations can result in shuffles. Not all these transformations are equal.

Question: If you have a small dataset, which needs to be joined with another bigger dataset, what approach you will use in this case?

Answer: As you mentioned one dataset is smaller and other is very big. Then we will consider using broadcast variable, which will help in improving the overall performance. To avoid shuffles when joining two datasets, you can use broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor. A map transformation can then reference the hash table to do lookups.

Question: When it is advantageous to have shuffle?

Answer: When you are working with huge volume of data and more processing power is also available. And application is compute intensive, hence we need to use shuffling in this case. So that data can be processed in parallel using all the available CPUs. Another use case is aggregation, if a huge volume of data and you want to apply aggregate function on that, then single thread of the driver will become bottleneck. You should shuffle data across the nodes and then apply the aggregate functions on that data locally on each node. So that data can be aggregated parallel first and then final aggration will be done on the driver program.

Question: Which of the two resources used by the Spark application, but cannot be managed by neither YARN nor Spark?

Answer: The two main resources that Spark and YARN manage are CPU and memory. Disk and network I/O affect Spark performance as well, but neither Spark nor YARN actively manage them.

Question: When you deploy Spark on YARN cluster manager, how does ApplicationMaster memory comes into the picture?

Answer: The ApplicationMaster, which is a non-executor container that can request containers from YARN, requires memory and CPU that must be accounted for. In client deployment mode, they default to 1024 MB and one core. In cluster deployment mode, the ApplicationMaster runs the Spark application driver, so consider bolstering its resources with the --driver-memory and --driver-cores flags.

Previous | Next | Audio Book for Spark Interview Questions is available here | Top 150 Latest Spark Interview Questions | Quickly go through Spark Training Python & Scala

Do you know?

Training Access: No time constraint and Any future enhancements on same and subscribed training will be free.
Question Bank (Online Simulator): Now you can have free updates for additional or updated Questions till your subscription is active.
On Mobile/Tablet/Desktop : You know this particular exam you can access from your mobile, tablet or Desktop. You just need internet access and browser.
Training Institute : Do you know many of the training institutes subscribe this products from HadoopExam to train their students.

Read all testimonials its learners voice : Testimonials

Disclaimer :
1. Hortonworks® is a registered trademark of Hortonworks.
2. Cloudera® is a registered trademark of Cloudera Inc
3. Azure® is aregistered trademark of Microsoft Inc.
4. Oracle®, Java® are registered trademark of Oracle Inc
5. SAS® is a registered trademark of SAS Inc
6. IBM® is a registered trademark of IBM Inc
7. DataStax ® is a registered trademark of DataStax
8. MapR® is a registered trademark of MapR Inc.