Question 91: What general problem, you see when you need to run pyspark?

Answer: Managing dependencies and making them available for Python jobs on a cluster can be difficult. To determine which dependencies are required on the cluster, you must understand that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python transformations you define use any third-party libraries, such as NumPy or nltk, Spark executors require access to those libraries when they run on remote executors.

Question 92: In which scenario, python is preferred than Scala and Java for Spark application?

Answer: Apache Spark provides APIs in non-JVM languages such as Python. Many data scientists use Python because it has a rich variety of numerical libraries with a statistical, machine-learning, or optimization focus

Question 93: If you need a single file, while running your program on various nodes in cluster, how can you provide that to your application?

Answer: If you need only single file, which needs to transferred each node, you can use --py-files option, while submitting the application using spark-submit and specify the local path to the file. Or you can use programmatic option, sc.addPyFiles() function.

Question 94: When you need functionality from multiple files, then how do you provide?

Answer: If you use functionality from multiple python files, then you can create an egg/zip for the package, because --py-files flag also accepts a path to an egg file.

Question 95: What can be a problem of distributing egg files?

Answer: sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.