Snowflake on Azure – Load with Synapse Pipeline

If you choose to use Snowflake along with Azure for your data platform, you will have to make choices on how to load the data. Landing processed data into your data lake on Azure Data Lake Storage Gen2 (ADLS) is the first step that I recommend in most environments. I like this pattern because then your data is available for any query tool or Azure service to use directly from ADLS. This keeps you from relying on reading from Snowflake for everything which can save cost and allow data scientists to choose what works best for them.

In this post I will show how to treat ADLS like any other source so that you know the full steps. In Synapse Pipelines or Azure Data Factory, the Snowflake sink is going to use the COPY command from Snowflake under the covers. This leads to it often needing to stage the data into Blob Storage and using SAS credentials. As you read through the steps I address that a bit more, but the main take away is that if you already have your data in ADLS or Blob Storage, you should try to set up the source in a way that COPY command is used rather than having an extra step to stage the data before load.

Steps to copy to Snowflake from any source

1. Browse to the Integrate hub on the left menu in Synapse Studio. Select the plus sign and choose Copy Data tool.

2. Under Connection menu select an existing connection or choose + New connection. I will use Azure Data Lake Storage Gen 2 from a linked service, which simulates moving data from a data lake into Snowflake (but read closely for steps to make it more efficient when data is already in ADLS). If you prefer not to store the data anywhere besides Snowflake you could copy from the actual source, for example connect directly to SQL Server if that is where the data originates. It will still require staging the data in Azure Blob Storage along the way but that is a seamless step once configured.

3. Inside the File or folder section, select Browse and select a folder that represents a table in your data lake. I will choose a partitioned folder so it will include a few extra steps, but I think this is the most realistic example.

4. If your folder is partitioned, check the box for Enable partition discovery and populate the Partition root path cell. Select Next.

5. Define the File format settings. For Parquet, set File format = Parquet format. Compression type will typically stay as snappy unless you use a custom compression when saving the files. You could easily use a DelimitedText format instead for CSV and similar data, but you will need to know some information about your data.

6. Next you will set the Destination data store to a Snowflake table. Select + New connection then find and select Snowflake (search is the quickest way to find it). NOTE: I recommend having your table created before trying to setup the linked service and make sure you know the ROLE that is the owner of the table.

7. Set the Account name with your information which can be found in your Snowflake URL when viewing the classic console or logging in. It will look similar to this: bg77599.west-us-2.azure


8. Next, provide your User name, Password, Database, Warehouse, and Role. This role is very important because it must be the owner of the object. I will choose SYSADMIN but a more specific role is preferred for production databases. As always, using Azure Key Vault is a good idea for production scenarios.

9. Be sure to enable staging in the Settings of the Copy Task if you have not configured the source as Blob Storage using a SAS credential.

10. Commit and select Debug to test out your copy activity. As you view the output you can view what is happening by selecting the glasses icon.

The initial view while its staging data will not show any movement to Snowflake. It completely stages the data first before starting the COPY into Snowflake.

After some time it will complete staging and then load to Snowflake. If the step to write to Snowflake is taking too long, you can scale up to a larger data warehouse size in your Snowflake account. If the step to stage the data is taking too long, you need to use typical Synapse Pipeline options to scale up parallelism and processing power (see the Settings tab of the copy activity).

If you are loading from SQL Server or something other than cloud storage this extra staging is a fine way to go, but keep in mind that the staged data is not easily searchable and usable in Azure. If you already have the data in Azure Storage, then see the steps below for a better option that does not require staging.

Example of staging data automatically created by Synapse Pipelines activity.

Steps to copy to Snowflake from ADLS or Blob Storage without staging

. Browse to the Integrate hub on the left menu in Synapse Studio. Select the plus sign and choose Copy Data tool.

2. Under Connection menu select an existing connection or choose + New connection. Let’s create using Azure Blob Storage linked service with SAS credential so that the Snowflake sink can get data directly from the source storage account.

The source linked service is Azure Blob storage with shared access signature authentication. If you want to directly copy data from Azure Data Lake Storage Gen2 in the following supported format, you can create an Azure Blob linked service with SAS authentication against your ADLS Gen2 account, to avoid using staged copy to Snowflake.

From Azure documentation: https://docs.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=synapse-analytics#direct-copy-to-snowflake
Screenshot of generating SAS credential with read access for storage account.
Screenshot of adding linked service in Synapse Pipelines.

3. Inside the File or folder section, select Browse and select a folder that represents a table in your data lake. If you try to use partition discovery again you will get an error:

Instead, set up the file path but skip partition discovery.

4. Set format to Parquet and snappy compression (or choose other settings if using your own files).

5. Set the correct path wildcards to read only valid parquet data within the subdirectories.

6. Select Snowflake linked service (or see steps 6-8 in above example to create). The table you load to must be a single column of type VARIANT if loading Parquet data in this way. That means you still may have additional steps in Snowflake to flatten data out into an easy to use table.

7. Finish and run Debug. View the copy activity as it runs and you will see data is going directly from Azure storage to Snowflake.

Final thoughts

These instructions are a solid starting point for using Synapse Pipelines or Azure Data Factory with Snowflake. They aren’t exactly how I would do it for production scenarios though, so stay tuned for future posts on this topic.

Leave a comment

Leave a Reply