Question 16: How do you construct a new RDD?

Answer: There are mainly two way by which you can create an RDD

  • If you have collection of data e.g. Java collections, you can parallelize it to create an RDD. (This is good for testing and prototyping)
  • You can create RDD which refers the data in external filesystem like HDFS, Local Filesystem, RDBMS SQL query output etc.

Question 17: Which all are the standard ways are available to pass the functions in Spark framework, using Scala?

Answer: There are mainly two ways, by which functions can be passed.

  • Anonymous Functions (Lambda functions): This is a good option, when you have to pass some small functionality, and quite simple. See below example, in which we are splitting data and returning splitted values.

val he_training = hadoopexamDataFile.map(he_course => he_course.split(","))

 

  • Static Singleton methods: When you need to do some complex operations on the data, then you should use this. If you know Java, then this (static) methods are associated with the class and not to the object.

object sampleFunc{ def dataSplit(s:String): Array[String]={s.split(“,”)}}

sampleRDD.flatMap(sampleFunc. dataSplit(_))

 

Question 18: You have a huge dataset, but you want to take out the sample, then there is a sample function as below.

Sample(withReplacement, fraction, seed)

 

How does withReplacement argument affect the output?

Answer: Whenever you want to get sample data from huge dataset, you can use this sample method. However, when you generate multiple sample output then first argument affect the output.

If withReplacement-> True then, both the generated sample will not be related. First sample will not affect the second sample, which says covariance between these two sample is zero.

Question 19: What is PairRDD?

Answer: PairRDD represent Key-value based data. You can assume it as a tuple of two values like (x,y).

Question 20: What is the main advantage of PairRDD?

Answer: The main advantage of key-value pair is that you can operate on data belonging to a particular key in parallel, like joining, aggregation etc.