Question 86: What general problem, you see when you need to run pyspark?
Answer: One of the major problems while using PySpark is managing third party dependencies and making them available for Python jobs on a cluster can be difficult. To determine which dependencies are required on the cluster, you must understand that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python transformations you define use any third-party libraries, such as NumPy, Spark executors require access to those libraries when they run on remote executors.
Question 87: In which scenario, python is preferred than Scala and Java for Spark application?
Answer: Apache Spark provides APIs in non-JVM languages such as Python. Many data scientists use Python because it has a rich variety of numerical libraries with a statistical, machine-learning, or optimization focus. So you can say, while working with Machine learning, you can prefer PySpark.
Question 88: If you need a single shared file, while running your program on various nodes in cluster, how can you provide that to your application?
Answer: If you need only single shared file, which needs to transferred each node, you can use --py-files option, while submitting the application using spark-submit and specify the local path to the file. Or you can use programmatic option, sc.addPyFiles() function.
Question 89: When you need functionality from multiple files, then how do you provide?
Answer: If you use functionality from multiple python files, then you can create an egg/zip for the package, because --py-files flag also accepts a path to an egg file.
Question 90: What can be a problem of distributing egg files?
Answer: sending egg (Collection of python files) files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.