When working with an Apache Spark environment you may need to install third party libraries or custom packages. In this post I share the steps for installing Java or Scala libraries to Azure Synapse serverless Apache Spark pools. For Java or Scala code the libraries are packaged as JAR files that you add to the pool. For Python code the libraries are packages as wheel (.whl) files. You can also install Python packages that are available publicly on the Python Package Index (PyPI). More on the specifics of installing Python packages will be covered in a future post.
If you prefer a video of how to find and install JAR files, check out my video showing how to get the Kafka dependencies installed in your Synapse Spark pool.
Installing JAR files (for Java/Scala libraries)
Adding additional libraries to the Java Virtual Machine (Java/Scala code) is a common way to get additional functionality available in Apache Spark. The Spark source code has a whole directory of external Spark modules (https://github.com/apache/spark/tree/master/external ) that can be added in but are not installed on every Spark environment by default. External libraries for JVM will be packaged as JAR files. There are a variety of ways to add extra JAR files to the set of known libraries on your Spark cluster, but with Synapse Spark pools the options are a little different than a standard Apache Spark installation.
I recommend using the Workspace packages feature to add JAR files and extend what your Synapse Spark pools can do. These JAR files could be either third party code or custom built libraries. In this screenshots for this post I use some dependencies for running Apache Kafka on a Synapse Apache Spark 3.1 pool, but many libraries are available to add.
Building or downloading JARs
When creating custom Scala libraries be sure that the Scala version matches what your Spark pool has installed. Currently for a Spark 3.1 pool you should use Scala 2.12 and for a Spark 2.4 pool use Scala 2.11. To see more details of what is installed on your pool you can checkout the runtime documentation pages for Spark 3.1 and Spark 2.4. If you are new to packaging up JAR files that is beyond the scope of this article, but I recommend searching for how to build a “fat jar” so it will include all the dependencies (be sure to mark Spark libraries as provided if they are part of your dependencies).
For open source libraries you may download the correct JAR from a public repository or build the JAR yourself from the source code. My preference is to search for the required package on https://mvnrepository.com then download the prebuilt JAR file. However, the library may have additional dependencies that are not included. To resolve missing dependencies you have to download those JARs and add to your workspace also. The video I posted talks a bit more about how to find the right version. The most important consideration is that you find a recent version where the Scala version is the same and the Spark version matches (when it’s an external Spark library).
Adding packages to a pool
To add packages, navigate to the Manage Hub in Azure Synapse Studio. Then select Workspace packages. In the Workspace Packages section, select Upload to add files from your computer. You can add JAR files or Python Wheels files.
Next, select Apache Spark pools which pulls up a list of pools to manage. Find the pool then select Packages from the action menu.
In the packages pane, add JAR or WHL files by choosing + Select from workspace packages. If the pool is in use, check the box under Force new settings to restart and make new libraries available for your next Spark session. Save changes by selecting Apply.
This approach to managing packages is what has worked best for me so far. Hopefully this has simplified the basics of managing Scala/Java libraries on Synapse, but for more information and options you can see the official documentation.