Question 6: Which all kind of data processing supported by Spark?

Answer: Spark offers three kinds of data processing using batch, interactive (Spark Shell), and stream processing with the unified API and data structures.

Question 7: How do you define SparkContext?

Answer: It’s an entry point for a Spark Job. Each Spark application starts by instantiating a Spark context. A Spark application is an instance of SparkContext. Or you can say, a Spark context constitutes a Spark application.

SparkContext represents the connection to a Spark execution environment (deployment mode).

A Spark context can be used to create RDDs, accumulators and broadcast variables, access Spark services and run jobs.

A Spark context is essentially a client of Spark’s execution environment and it acts as the master of your Spark.

Question 8: How can you define SparkConf?

Answer: Spark properties control most application settings and are configured separately for each application. These properties can be set directly on a SparkConf passed to your SparkContext. SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set() method. For example, we could initialize an application with two threads as follows:

Note that we run with local[2], meaning two threads - which represents “minimal” parallelism, which can help detect bugs that only exist when we run in a distributed context.

val conf = new SparkConf().setMaster("local[2]").setAppName("CountingSheep")

val sc = new SparkContext(conf)

 

Question 9: Which all are the, ways to configure Spark Properties and order them least important to the most important.

Answer: There are the following ways to set up properties for Spark and user programs (in the order of importance from the least important to the most important):

  • conf/spark-defaults.conf - the default
  • --conf - the command line option used by spark-shell and spark-submit
  • SparkConf

 

Question 10: What is the Default level of parallelism in Spark?

Answer: Default level of parallelism is the number of partitions when not specified explicitly by a user.