Spark cluster is sometimes necessary to use some third-party packages, such as graphframes, kafka, etc. (The following are graphframes for example). According to official documents, packages usually a command-line option to solve the problem:
$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.6.0-spark2.2-s_2.11
This command will download graphframes and its dependent jar package from the network, save it to $HOME/.ivy2/jars
the. But when Spark cluster is offline or in poor network conditions how to handle it?
Find a host of Internet download package.
Use packages command-line option to download the package from the jar$HOME/.ivy2/jars
extracted in.If you use pyspark, also you need to extract the relevant python bag.
For graphframes, it is to graphframes_graphframes-0.6.0-spark2.2-s_2.11.jar decompression, which is packing graphframes folder and added to the zip package environment variable PYTHONPATH.
unzip graphframes_graphframes-0.6.0-spark2.2-s_2.11.jar
zip -r graphframes.zip graphframes
- The option packages into jars option to just download jar package are added to the options.
Command line becomes:
export PYTHONPATH=$PYTHONPATH:/path/to/graphframes.zip
$SPARK_HOME/bin/spark-shell --jars /path/to/graphframes:graphframes:0.6.0-spark2.2-s_2.11.jar,/path/to/xxx.jar