Azure Synapse CI/CD

Azure Synapse Analytics is a powerful set of capabilities for building data lakes and data warehouses within Azure. For production uses of Azure Synapse there are benefits to implementing Continuous Integration (CI) and Continuous Deployment (CD). Implementing CI/CD includes the need to deploy the Azure infrastructure in an automated way. Ideally, once configured a deployment to a new environment does not require any manual steps.

When I was asked to work with a Site Reliability Engineer on the team to set up CI/CD for Azure Synapse I found some good resources for how to set it up, but I still had some questions and challenges. I expect the standard way of setting up CI/CD in Azure Synapse will evolve and some of the confusing parts may become more clear in the future, but for now I expect the process with be a challenge for others. In this post, I share things I learned that may be helpful for you. I also have a few links to other content that was helpful for me to get an environment setup (though I may try to create a condensed tutorial in the future if I’m feeling bold).

Overview

Environments

This will be brief since the resources included have more explanation about why this all matters. The first thing to understand is that the normal pattern is to have a Development workspace, a Test/UAT workspace, and a Production workspace. Each of these environments has its own Azure storage, key vault, and databases (if needed). Most often I see these within different Azure subscriptions so that there is a low risk of any communication or data transfer between these environments. The development team would have fewer permissions in production than they have in development. The data in development is scrubbed, anonymized, or synthetic (fake) data. This isn’t always the case because realistic data will lead to better development of data systems, but the risk of keeping any sensitive information is high.

Build / Release Pipelines

The deployment of each Synapse workspace is separate from the deployment of the scripts, notebooks, pipelines, and related dependencies. The first deploy pipeline I will refer to as the Infrastructure deploy and the second will be called the Workspace deploy.

The Infrastructure deploy will mostly be an ARM template that creates the workspace, spark pools, private endpoints, storage account, key vault, and other optional resources. The best way to understand this is that once the infrastructure deploy has run we have a usable Synapse workspace but no artifacts or data.

The Workspace deploy will also use a template but it is generated by Synapse automatically when you publish. Using this template in multiple environments requires some thought to make sure names are similar and the values that must be different between each environment are parameterized. Each of the Recommended Resources below covers this to an extent, but see the section on What I Learned for my attempt to clarify.

Code repository and publishing

A key concept for CICD to work is connecting a development Synapse workspace to version control. This connection to source control should be configured only on the Development workspace. We can choose GitHub or Azure DevOps Git. I have done this with Azure DevOps but the documentation shows how to deploy from either using Azure DevOps Pipelines or GitHub Actions.

Each resource below has information on setting up the code repository so I won’t repeat those steps. When we set it up it was challenging if the user was not in the same Azure Active Directory tenant but there are more options available now which may improve that flow.

The basic ideas to understand with the code repository are:

  1. For any work, create a new branch (feature branch) from the main branch (or whatever you call this primary branch that you will publish from). If you are committing directly to main, Synapse will create many commits and it will be difficult to tell what all has changed.
  2. Use a Pull Request process to merge into main. You can add rules around the branch to make sure developers have to go through a PR to get code into main. I recommend a Squash Merge to reduce all your commits on your feature branch into a single commit to main.
  3. Select publish to deploy into the environment and have code ready to deploy to other environments. The branch workspace_publish is the default branch that will be used. This branch will contain a JSON template with all the artifacts together for deployment rather than each artifact as a separate JSON file like in the main branch.
  4. The steps to deploy to other environments will rely on the workspace_publish branch and typically add some additional parameter files and deployment pipeline YAML files.

Recommended Resources

Craig Porteous – Adventures in CICD with Azure Synapse

A helpful video to understand the concepts and the implementation is the session Adventures in CICD with Azure Synapse by Craig Porteous. He explains the fundamentals around CI/CD and then walks through a lot of what it takes to set it up with Azure Synapse.

Arshad Ali – Azure Synapse Analytics CI/CD

Another video that is more focused on walking through each step needed. Some of the questions I had about how the template works and which values need replacing were answered in this video. If you have already worked through setting up the Git repository and the infrastructure deploy, start at minute 27 in the video to get a walkthrough of the publish template.

Bradley Ball – CI CD in Azure Synapse Analytics

If you like to read tutorials over watching videos, check out this series of posts by Bradley Ball which covers the information well. I recommend taking your time to walk through each step in order to replicate it. However, even without carefully following each part it helped me get started.

https://techcommunity.microsoft.com/t5/data-architecture-blog/ci-cd-in-azure-synapse-analytics-part-1/ba-p/1964172

What I Learned

Some resources in the template are skipped

The virtual network and spark pool resources in the workspace template are skipped when a deploy is run. This is a good thing, but we didn’t realize this at first. Once I started seeing what is skipped and what isn’t, I was able to develop strategies for creating dependencies. You want the name to be the same for Spark Pools, but extra information in the resource object is not used so it can continue to reference the development subscription. Linked Services are created and are fairly easy to override everything but the name, so choose names that can be the same in each environment.

Do not reference default linked services by name

The basic guidance here is that the default linked service is going to have a specific name that is not automatically parameterized. Each Synapse environment will create this by default with its own workspace name included. If you create a pipeline that uses that default ADLS linked service as a source or sink when you deploy to UAT or Prod it will continue trying to use the dev linked service name. Instead, if needed, create a separate linked service with a name you can use in all environments

Dedicate SQL pool is turned on during deploy

The dedicated SQL pool gets deployed and turned on if its online in the develop environment. So you likely want to script a step to take if offline after deploy at least in test environments. Craig Porteous shows a snippet that does this in his video.

Triggers are disabled during deploy

Triggers will automatically be stopped when deployed to each environment. The Synapse Workspace deploy task has a feature to set which triggers should be enabled, which may vary depending on the environment.

- task: toggle-triggers-dev@2
  inputs:
    azureSubscription: 'Microsoft Azure Standard(a11affbe-2256-409d-a682-c20a3963099a)'
    ResourceGroupName: 'datakickstart-rg-prod'
    WorkspaceName: 'datakickstart-synapse-prod'
    ToggleOn: true
    Triggers: '*'

Secured credentials require override parameters

Override parameters are required in the workspace deploy task for any secured credentials. If the value should be kept secret and out of your Git repository, use a variable in Azure DevOps. I typically have connections retrieve passwords or other secret values from Azure Key Vault so most of my parameters can be stored in a private Git repo.

OverrideArmParameters: '-workspaceName datakickstart-synapse-prod -datakickstart-synapse-uat-WorkspaceDefaultStorage_properties_typeProperties_url https://adlsdatakickstartprod.dfs.core.windows.net'

Wish List

Parameterize names and subscriptions

The default linked services have specific names that can’t be changed and are not parameterized. It would be nice to be able to set linked service names and the resource information as parameterized so that there is more flexibility. It is difficult at first to manage all the names and know if its important to replace a name in the template or ignore it, but more parameterization could help.

Customization for the VNet (or bring your own VNet)

Managed VNet is too managed. It does not provide a simple way to choose your IP range and whitelist resources. The work around to securely connect to non-Azure resources from the managed VNet is complicated and not reasonable for many teams to configure. Azure Databricks has a Bring Your Own VNet approach which is complicated but much more feasible to understand and setup.

Easier PR review experience

Reviewing PRs or commits to the publish branch is difficult as you are comparing JSON files rather than typical code files. If you want to see what has changed in the workspace_publish branch you have one huge JSON file with a lot of lines that you would rather ignore when considering changes. I am not sure how all the information in the template is used, but when reviewing the template you have to ignore lines of JSON that specify “a365ComputeOptions” and “spark.autotune.trackingId” while trying to notice any source changes for the code cells.

Leave a comment

Leave a Reply