Question 71: What is Avro files?

Answer: Watch the video from this training.

This is a serialization system, with binary encoding. One of the best feature is that Avro is a language independent, however, it is expected that you have provided .avro as a file extension. By default Avro data files are not compressed, but it is recommended enabling compression to reduce disk usage and increase read and write performance. Avro data files support Deflate and Snappy compression. Snappy is faster, but Deflate is slightly more compact.

You do not need to specify configuration to read a compressed Avro data file. However, to write an Avro data file, you must specify the type of compression. How you specify compression depends on the component.

Question 72: What are the limitations, when you process Avro files, using Spark?

Answer: As we know, when we read data in Spark, it can convert data types, which are specific to Spark framework, hence below are very important points when you work with the Avro file using Spark framework, as we know avro also has schema assigned for its data:

  • Avro contains Enumerated types, will be erased by Spark- Avro enumerated types become strings when they are read into Spark, because Spark does not support enumerated types.
  • All the out will be Unions - Spark writes everything as unions of the given type along with a null option.
  • Schema of the Avro will be updated/changed - Spark reads everything into an internal representation. Even if you just read and then write the avro data, the schema for the output is different and would be as per the Spark specifications.
  • Spark schema reordering - Spark reorders the elements in its schema when writing them to disk so that the elements being partitioned on are the last elements. 

Question 73: How do you read and write parquet files in Apache Spark?

Answer: To read and write parquet files are quite simple in Spark, you can use SQLContext and DataFrame to read and write parquet files.

Read -> SQLContext.read.parquet(“parquet file input path”)

Write -> DataFrame.write.parquet(“parquet output file path”)

 

Question 74: How can you create application, when code is written using Java or Scala?

Answer: You can use Maven build tool, to create application written in Java or Scala. For Scala, you can use SBT.

Question 75: How do you submit Spark applications?

Answer: To submit Spark application, you have to use “spark-submit” utility. Common use will be as below to submit using YARN.

spark-submit \

--class com.hadoopexam.analytics.WordCount \

--master yarn \

--deploy-mode cluster \

--conf "spark.eventLog.dir=hdfs://hadoopexam:8020/user/spark/hadoopexamlog" \

lib/hadoopexam-example.jar \

10