Azure | DUSTIN VANNOY

Databricks Asset Bundles: Advanced Examples

By Dustin Vannoy Jun 25, 2024 / 2 Comments

This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction please see my first post on Databricks Asset Bundles. The github repository I use will probably be first to update with new examples, however I… Continue Reading

Azure Databricks with Log Analytics – Updated for DBR 11.3+

By Dustin Vannoy Jan 7, 2024 / 1 Comment

This is an updated video and writeup on setting up and using Log Analytics with your Azure Databricks logs. Some of the content overlaps with what I shared in the past, but these instructions are valid for Databricks Runtimes 11.3+. Log Analytics provides a way to collect and query logs in Azure. For teams that… Continue Reading

Incremental Data Loading with Azure Databricks

By Dustin Vannoy Nov 15, 2023 / Leave a comment

My talk for PASS Summit 2023 is about how to load data incrementally, such as from Change Data Feed or streaming a log of events. Below are some additional thoughts and links to resources for easy reference. Presentation description: There has been an increasing push to load data incrementally throughout the day or even within… Continue Reading

Azure Data Platform Overview slides

By Dustin Vannoy Aug 18, 2023 / Leave a comment

I had the privilege to present for Creating Coding Careers, a great organization in the San Diego area that helps people get established in tech careers via apprenticeships and other programs. Above are the slides used in that presentation. Recommended Resources to learn Azure Data Platform Databricks Training https://www.databricks.com/learn Microsoft Learn Training https://learn.microsoft.com/en-us/training/paths/data-engineer-azure-databricks/ https://learn.microsoft.com/en-us/training/paths/get-started-data-engineering/ https://learn.microsoft.com/en-us/training/paths/get-started-fabric/… Continue Reading

Snowflake on Azure – Load with Synapse Pipeline

By Dustin Vannoy Jun 30, 2022 / Leave a comment

If you choose to use Snowflake along with Azure for your data platform, you will have to make choices on how to load the data. Landing processed data into your data lake on Azure Data Lake Storage Gen2 (ADLS) is the first step that I recommend in most environments. I like this pattern because then… Continue Reading

Snowflake on Azure – Create External Stage

By Dustin Vannoy May 31, 2022 / 3 Comments

Snowflake, like similar analytic databases, has a fast way to load data from files. The COPY command can quickly read files and append the records to a table. It does this by reading from an external stage which points to a cloud storage location. This currently supports Azure Storage, Amazon S3, and Google Cloud Storage.… Continue Reading

Ingest tables in parallel with an Apache Spark notebook using multithreading

By Dustin Vannoy May 6, 2022 / 2 Comments

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Azure Synapse Analytics Kickstart

By Dustin Vannoy Apr 13, 2022 / Leave a comment

In this post I introduce some of the core capabilities of Azure Synapse Analytics and when they are used. I present from the perspective of data engineer but it should be easy to translate what is most useful for analysts and data scientists also. Please continue reading for a quick walkthrough of the capabilities and… Continue Reading

Azure Synapse CI/CD

By Dustin Vannoy Apr 6, 2022 / Leave a comment

For production uses of Azure Synapse there are benefits to implementing Continuous Integration (CI) and Continuous Deployment (CD). Implementing CI/CD includes the need to deploy the Azure infrastructure in an automated way. In this post, I share things I learned that may be helpful for you. I also have a few links to other content that was helpful for me to get an environment setup.

Azure Synapse Spark: External Python Packages

By Dustin Vannoy Jan 20, 2022 / 1 Comment

When working with an Apache Spark environment you may need to install external libraries or custom packages. In this post I share the steps for installing Python packages to Azure Synapse serverless Apache Spark pools. For Python code the libraries are packages as wheel (.whl) files. You can also install Python packages that are available… Continue Reading

Tag: Azure

Stay informed