spark|stion: Which all are the shared variables are p

Mobile: +91-8879712614 Phone:022-42669636 | Email : hadoopexam@gmail.com admin@hadoopexam.com

All Products Spark IBM MapR Hortonworks Cloudera NiFi Amazon AWS SAS HBase Cassandra Salesforce Oracle Cloud & Java Android To Activate Free Resources Forum Subscribe Annual Subscription (50%+49% off) Author/Trainer For Business Blog

25000+ Learners upgraded/switched career Testimonials

All Certifications preparation material is for renowned vendors like Cloudera, MapR, EMC, Databricks,SAS, Datastax, Oracle, NetApp etc , which has more value, reliability and consideration in industry other than any training institutional certifications.
Note : You can choose more than one product to have custome package created from below and send email to hadoopexam@gmail.com to get discount.Premium Trainings Courses : HadoopExam focuses on in depth learning with the hands-on session setting up the environment than executing solution and doing hands on that. Below are the available trainings and we are keep adding new trainings. These trainings is being used and subscribed by Devloper, Tester, Administrator, Enterprise(to train their team) and Trainer globally. These trainings are well organized and step by step solutions to learning, and in lesser time as per your convenience you can complete these and even re-visit as required.

All Premium Training Access Annual Subscription (You will get early access to under development training and early edition books) : Used By More than 20000 subscribers

Access All Annual/Semi Annual/Quarterly Subscription from this Link

Previous | Next | Audio Book for Spark Interview Questions is available here | Top 150 Latest Spark Interview Questions | Quickly go through Spark Training Python & Scala

Question: Which all are the shared variables are provided by the Spark framework?

Answer: In Spark shared variables means, the variables which provides data sharing globally across the nodes. This can be implemented using below two variables, each has different purpose.
- Broadcast variables: Read only variables cached on each node on the cluster. This variables cannot be updated on individual node. It is more of a same data, you want to share across the nodes, during data processing.
- Accumulators: This variable can be updated on each individual node. However, final value will be aggregated, which is sent by each individual node.

Question: Please give us the scenario, in which case you will be using broadcast and accumulator shared variable?

Answer: Broadcast variables: You can use it as a cached data on each node. So whenever we need most frequently used small dataset which entire data processing. Then ask Spark to cache this small dataset on each node, this can be done using broadcast variable and during calculation, you can refer this cached data.
You can set the broadcast variable using driver program, and will be retrieved by the worker node on the cluster. Remember, broadcast variable will be retrieved and cached only when first read request is sent.
Accumulator: You can consider them more as a global counter. Remember they are not read-only variables, on each worker node, executor will update the counter independently. Then driver program will accumulate all the accumulator from worker node and generate aggregated result.
So you can use them, when you need to do some counting like how many messages were not processed correctly. So using accumulator on each node individual count will be generated for the messages which are not processed, and at the last at driver side all the count will be accumulated, and you will get to know, which all messages are not processed.

Question: How do you define ETL process?

Answer: ETL extends to extraction, transformation and loading. This is where, we create data pipelines for data movement and transformation. In short there are three stages (now a days order of ETL steps can be re-ordered and sometime it could be ELT)
- Extract: You will extract data from most of the source systems like RDBMS, FlatFiles, Social Networking feed, web log files etc. Data can be in various formats like XML, CSV,JSON, Parquet, AVRO, also frequency of the data retrieval can also be defined as daily, hourly etc.
- Transform: In this step you will be transforming data as per your downstream system expect. For example from text file, you can create JSON file. Like changing the file formats, similarly you can filter valid and invalid data. In this step you would do many sub-steps to clean your data as next step expected.
- Loading: This step refer to send the data in the sink, where you have defined. In hadoop world it could be HDFS, Hive tables, HDFS etc. In case of RDBMS it could be MySQL, Oracle and for NoSQL it could be Cassandra, MongoDB
However, please note that, Spark is not an ETL tool, you can have some ETL job done using entirely Spark framework.

Question: How do you save data from an RDD to a text file?

Answer: You have to use RDDs method saveAsTextFile(destination_path). Similarly for other file formats various other methods are available.

Question: What is Spark DataFrame and what are its basic properties?

Answer: Spark DataFrame, you can visualize as a table in Relational Databases. It has following features as well.
- It is distributed over the Spark Clustered Nodes.
- Data organized in columns.
- It is immutable (to modify it, you have to create new DataFrame)
- It is in-memory
- You can applies schema to this data.
- They also help you to have Domain Specific language (DSL)
- They are evaluated lazily.
In one line you can say, DataFrames is an immutable distributed collection of data organized into named columns. DataFrame helps you take away the RDDs complexity.

Previous | Next | Audio Book for Spark Interview Questions is available here | Top 150 Latest Spark Interview Questions | Quickly go through Spark Training Python & Scala

Do you know?

Training Access: No time constraint and Any future enhancements on same and subscribed training will be free.
Question Bank (Online Simulator): Now you can have free updates for additional or updated Questions till your subscription is active.
On Mobile/Tablet/Desktop : You know this particular exam you can access from your mobile, tablet or Desktop. You just need internet access and browser.
Training Institute : Do you know many of the training institutes subscribe this products from HadoopExam to train their students.

Read all testimonials its learners voice : Testimonials

Disclaimer :
1. Hortonworks® is a registered trademark of Hortonworks.
2. Cloudera® is a registered trademark of Cloudera Inc
3. Azure® is aregistered trademark of Microsoft Inc.
4. Oracle®, Java® are registered trademark of Oracle Inc
5. SAS® is a registered trademark of SAS Inc
6. IBM® is a registered trademark of IBM Inc
7. DataStax ® is a registered trademark of DataStax
8. MapR® is a registered trademark of MapR Inc.