When working with an Apache Spark environment you may need to install external libraries or custom packages. In this post I share the steps for installing Python packages to Azure Synapse serverless Apache Spark pools. For Python code the libraries are packages as wheel (.whl) files. You can also install Python packages that are available publicly on the Python Package Index (PyPI).
Installing Python packages from PyPI, Conda, etc
Azure Synapse Spark pools come with all the Anaconda libraries installed, but you will likely find a need for additional libraries or different versions of libraries. It is fairly simple when the required libraries exist in a repository like Conda or PyPI (Python Packaging Index). In this case you can add a file that specifies the package names and versions to install. This can be a requirements.txt file or an environment.yml file. I recommend using whichever file type you have the most experience using. Many Python developers have used the requirements.txt format, but if you want to specify a specific channel to get libraries from the environment.yml file is needed. According to the documentation you can use both requirements.txt and environment.yml together.
In Azure Synapse Studio, browse to a specific Apache Spark pool and select Packages.
Upload the file from the Requirements files section.
The package will be installed the next time a sessions starts up.
Less common is to add the libraries only to your current session. These will only be available for this session so it’s best suited for testing out new libraries.
For that you go to the Settings from your PySpark notebook.
Then choose Packages and Uploade a file.
For session packages you need to use a YML file to define the libraries and related settings.
Installing WHL files (for Python packages)
Another option to install Python libraries is to add individual WHL files that will be installed when the Spark pool starts. This method is required if the code is not available in a repository that your Spark pool can reach. When working with a small number of packages this is an easy approach to get the exact version setup so you may prefer it even if the library is available in a repository.
A note about the “Storage Account” method
There is a way to add WHL files by putting them in a specific folder on the Primary ADLS account for your Synapse Workspace. The documentation clearly says this is not supported for Apache Spark 3.0 so I would avoid using that for new development. At some point you will likely want to upgrade to Spark 3.0 or beyond so it’s better to avoid this possibly breaking when you upgrade.
Building or downloading WHL files
When creating custom Python libraries be sure that the Python version matches what your Spark pool has installed. Currently for a Spark 3.1 pool you should use Python 3.8 and for Spark 2.4 pool use Python 3.6. To see more details of what is installed on your pool you can check out the runtime documentation pages for Spark 3.1 and Spark 2.4. We won’t cover building WHL files in this post, but a common way can be found at the Python Packaging Authority.
For open source libraries you may download the the correct WHL file from a repository or if needed build it from the source code.
Adding packages to a pool
To add packages, navigate to the Manage Hub in Azure Synapse Studio. Then select Workspace packages. In the Workspace Packages section, select Upload to add files from your computer. You can add JAR files or Python Wheels files.
Next, select Apache Spark pools which pulls up a list of pools to manage. Find the pool then select Packages from the action menu.
In the packages pane, add WHL files by choosing + Select from workspace packages. If the pool is in use, check the box under Force new settings to restart and make new libraries available for your next Spark session. Save changes by selecting Apply.
Guidance and references
When adding Python libraries that are publicly available, I prefer to use pool packages via a requirements.txt file or environment.yml file. For custom packages, you could publish those to a private channel and make that available but it will likely be easiest to just add them as workspace packages. Using session packages or adding libraries directly to the synapse folders in ADLS are other options that are more complicated to use long term.
The official documentation covers more caveats and options so please consult that to go beyond simple use cases. Please leave comments if you find undocumented issues or need help understanding an error message. Most errors I saw were about having the wrong Python version or wrong package version. It wasn’t always clear what the problem was, but testing different versions was helpful. You can also specify a version for each library in the requirements.txt or environment.yml file.