Snowflake on Azure – Load with COPY INTO

The COPY command is a fast way to read files into Snowflake. It will read files matching a pattern and append the records to a table. It does this by reading from an external stage which points to a cloud storage location. This currently supports Azure Storage, Amazon S3, and Google Cloud Storage. Alternatively, you can setup an internal stage which is a file location within Snowflake and PUT files there via an API call.

This tutorial may be helpful even if Azure is not the source, but the stage creation is specific to Azure. In this tutorial we cover some basic but realistic examples of loading from CSV or Parquet files. The source data is in partitioned folders following a pattern of puYear=#### and puMonth=##, but we do not use the partition columns until the last example.

Prerequisites

To follow along you will need Storage Integration defined that points to a cloud storage environment. For my scenario I created an integration to Azure Storage. The steps to set that up are covered in my previous post Snowflake on Azure – Create External Stage.

If you do not have a Snowflake account yet, you can create a trial account if you have not registered your email in the past.

The sample data I will use is in my own Azure Storage account but was copied from the NYC Taxi & Limousine Commission – yellow taxi trip sample dataset available in Azure Synapse. To add this dataset to your workspace, go to Data Hub -> + (Add new resource) -> Browse Gallery. Then run the sample notebook to copy parquet files and create csv versions which creates them in your default storage location (which you will then use in the steps to Create External Stage). Otherwise, please use your own dataset and modify the commands so that they work for you.

All the commands will be run from a worksheet in Snowflake. If you choose to use SnowSQL or some other method the syntax may vary slightly. The first step is to create an EXTERNAL STAGE if you do not already have one. If you are using a stage that has the file format specified already then you will not need to include the file format in the COPY INTO commands.

Create External Stage

1. Use the SYSADMIN role or another with appropriate permissions to create a stage. In this example I will not specify the file format in the stage. You have the choice to add file format if you will only read files of a certain format. For example, if only reading Parquet files you could add file_format = (type = parquet).

CREATE OR REPLACE STAGE nyctaxi_yellow_raw
storage_integration = azure_int
url = 'azure://<storage-account-name>.blob.core.windows.net/<container-name>/nyctaxi_raw';

2. Run list command to confirm stage is created and view files.

list @nyctaxi_yellow_raw

COPY INTO from CSV

Often our source data is in CSV (Comma Separated Values) format since it is very easy to export data to delimited files. Note that with Snowflake, like in many systems, CSV format is used even if the delimiter is something other than a column.

Another reason you may choose to load from CSV is that the COPY INTO command is typically fastest when loading compressed CSV files rather than semi-structured formats like Parquet, JSON, and ORC.

1. Create the destination table in Snowflake.

CREATE OR REPLACE TABLE trips_yellow_csv
(
vendorID string,
tpepPickupDateTime timestamp,
tpepDropoffDateTime timestamp,
passengerCount int,
tripDistance double,
puLocationId int,
doLocationId int,
startLon double,
startLat double,
endLon double,
endLat double,
rateCodeId int,
storeAndFwdFlag string,
paymentType double,
fareAmount double,
extra double,
mtaTax double,
improvementSurcharge string,
tipAmount double,
tollsAmount double,
totalAmount double
);

2. Create a file format. We choose to create a named format because there are many options related to CSV files and the defaults are less likely to be sufficient. This statement is a subset of the options available for the CSV file format.

create or replace file format nyctaxi_csv
type = csv field_delimiter=',' record_delimiter='\n'
field_optionally_enclosed_by='"' compression='auto'
skip_header=0 null_if=('')
date_format='auto' timestamp_format='auto';

3. Run the COPY INTO statement.

In my test run using an X-SMALL data warehouse it took 58 seconds to load 37 files. When I tested with a LARGE data warehouse, it took 28 seconds to load all 37 files. I thought CSV would be faster than Parquet when using COPY INTO, especially since I have fewer files, but if you keep reading you will see that CSV was faster with X-SMALL but not with a LARGE warehouse. However, this isn’t exactly an apples to apples comparison so you may see different results for different workloads.

If you try to run the same command again it will load 0 files. This happens because Snowflake knows it has already loaded the files. If you want to truncate and reload the data you can recreate the table. If you want to append data regardless of previous load then you can set option FORCE=True or modify the files to change the checksum.

4. Run a sample query to visually inspect the data loaded into the table.

select top 20 * from trips_yellow_csv;

COPY INTO from Parquet

With Parquet format you have some options about how to initially load the data. You could create a table with a single column of data type VARIANT and load each record to that column. When querying the table you would then reference the fields it needs from the VARIANT column. Likely you would have a second step that parses each key field from the VARIANT column, does other required transformations, then writes to a final table to be used in reports and analysis. The other option which I prefer is to parse the fields as you load with the COPY INTO statement. This method of specifying the fields and data types during the load process is what I show in the following steps.

I expect the more you do the processing within Snowflake the more you will pay on your Snowflake bill, but it is worth doing a test with your own data if you will be running the load process frequently.

1. Create the destination table in Snowflake.

CREATE OR REPLACE TABLE trips_yellow_parquet
(
vendorID string,
tpepPickupDateTime timestamp,
tpepDropoffDateTime timestamp,
passengerCount int,
tripDistance double,
puLocationId int,
doLocationId int,
startLon double,
startLat double,
endLon double,
endLat double,
rateCodeId int,
storeAndFwdFlag string,
paymentType double,
fareAmount double,
extra double,
mtaTax double,
improvementSurcharge string,
tipAmount double,
tollsAmount double,
totalAmount double
);

2. Run the COPY INTO command with select clause that maps columns in the Parquet file to destination columns.

copy into trips_yellow_parquet
from (select
$1:vendorID::string,
$1:tpepPickupDateTime::timestamp,
$1:tpepDropoffDateTime::timestamp,
$1:passengerCount::int,
$1:tripDistance::double,
$1:puLocationId::int,
$1:doLocationId::int,
$1:startLon::double,
$1:startLat::double,
$1:endLon::double,
$1:endLat::double,
$1:rateCodeId::int,
$1:storeAndFwdFlag::varchar,
$1:paymentType::int,
$1:fareAmount::double,
$1:extra::double,
$1:mtaTax::double,
$1:improvementSurcharge::string,
$1:tipAmount::double,
$1:tollsAmount::double,
$1:totalAmount::float
from @nyctaxi_yellow_raw)
file_format = (TYPE= PARQUET)
pattern='.*/parquet/yellow/puYear=2019/.*/.*\.parquet'
on_error = SKIP_FILE;

In my test run using an X-SMALL data warehouse it took 1 minute 35 seconds. When I tested with a LARGE data warehouse, it took 17 seconds to load all 191 files.

3. Run a sample query to visually inspect the data loaded into the table.

Select top 20 * from trips_yellow_parquet;

Add Parquet partition columns

When working with Parquet the partition fields may only be in the path and need to be extracted upon load. Snowflake does not use partitioning defined by the user, but it does allow for adding clustering keys when appropriate. So in this case I am going to assume the data is appended one month at a time then never changed, so the overhead of adding a clustering key is not worth it. However, if my source path has partitions I still may need to store those values for reference.

To make this practical, we want to add a puYear and puMonth that are part of the source path as fields in the Snowflake table.

First we should recreate the table with the added columns.

CREATE OR REPLACE TABLE trips_yellow_parquet
(
puYear string,
puMonth string,
vendorID string,
tpepPickupDateTime timestamp,
tpepDropoffDateTime timestamp,
passengerCount int,
tripDistance double,
puLocationId int,
doLocationId int,
startLon double,
startLat double,
endLon double,
endLat double,
rateCodeId int,
storeAndFwdFlag string,
paymentType double,
fareAmount double,
extra double,
mtaTax double,
improvementSurcharge string,
tipAmount double,
tollsAmount double,
totalAmount double
);

Then we add two calculated columns which will use the filename and regular expressions to extract the values needed.

regexp_replace(METADATA$FILENAME,
 '.*\=(.*)\\/.*=(.*)\\/.*', '\\1') as puYear,
regexp_replace(METADATA$FILENAME, '.*\=(.*)\\/.*=(.*)\\/.*',
 '\\2') as puMonth,

The regexp_replace is a way to pull certain parts out of a field that match a specified pattern. The first parameter is the string to parse so we pass in METADATA$FILENAME, a built-in field with the full path. The second parameter is the regular expression which looks for equals signs and groups the value between the first equal sign and / as item 1. It groups the value between the second equal sign and / as item 2. The third parameter is which item we want to return. Year is item 1, specified by string ‘\\1’ in the worksheet. Month is item 2, so we use ‘\\2’.

Below is what the full copy statement looks like with these added columns.

copy into trips_yellow_parquet
from (select
regexp_replace(METADATA$FILENAME,
 '.*\=(.*)\\/.*=(.*)\\/.*', '\\1') as puYear,
regexp_replace(METADATA$FILENAME, '.*\=(.*)\\/.*=(.*)\\/.*',
 '\\2') as puMonth,
$1:vendorID::string,
$1:tpepPickupDateTime::timestamp,
$1:tpepDropoffDateTime::timestamp,
$1:passengerCount::int,
$1:tripDistance::double,
$1:puLocationId::int,
$1:doLocationId::int,
$1:startLon::double,
$1:startLat::double,
$1:endLon::double,
$1:endLat::double,
$1:rateCodeId::int,
$1:storeAndFwdFlag::varchar,
$1:paymentType::int,
$1:fareAmount::double,
$1:extra::double,
$1:mtaTax::double,
$1:improvementSurcharge::string,
$1:tipAmount::double,
$1:tollsAmount::double,
$1:totalAmount::float
from @nyctaxi_yellow_raw)
file_format = (TYPE= PARQUET)
pattern='./parquet/yellow/puYear=2019/./.*.parquet'
on_error = SKIP_FILE;

And now our results contain these 2 new fields.

Wrap it up

This tutorial covered some basic but realistic examples of loading from CSV or Parquet files. In all cases we have source data that is paritioned with folders following a pattern of puYear=#### and puMonth=##. The last example covered how to add the value for each partition as columns in the table.

An amazing reference that was helpful in understanding load speeds and working with partition columns is this Snowflake blog post on Speeds, Feeds, and Techniques.

One parting note is that you can use some schema inference functions to help in creating the load scripts for Parquet data, but that is a topic for another time.

DUSTIN VANNOY

Snowflake on Azure – Load with COPY INTO

Prerequisites

Create External Stage

COPY INTO from CSV

COPY INTO from Parquet

Add Parquet partition columns

Wrap it up

Like this:

Leave a comment

Leave a ReplyCancel reply

About

Featured Posts

Claude Code Essentials for Data Professionals

Cursor with Databricks: AI Enhanced Development

OSS Spotlight: Unity Catalog

Essential Best Practices for Data Engineers on Databricks

PASS 2024 – Databricks Resources for DevX and CICD

Databricks Asset Bundles: Advanced Examples

Prerequisites

Create External Stage

COPY INTO from CSV

COPY INTO from Parquet

Add Parquet partition columns

Wrap it up

Share this:

Like this:

Leave a ReplyCancel reply

About

Stay informed

Featured Posts

Discover more from DUSTIN VANNOY