Data Engineer | DUSTIN VANNOY

Claude Code Essentials for Data Professionals

By Dustin Vannoy Jan 8, 2026 / Leave a comment

I believe AI coding is a big part of the future for data professionals—including data engineering, data science, and analytics engineering. This means that adopting AI for development will be critical for career success. Since the Cursor article and video, I’ve been digging into the AI coding space more and using Claude Code as well,… Continue Reading

Cursor with Databricks: AI Enhanced Development

By Dustin Vannoy Sep 29, 2025 / 1 Comment

The tech industry has evolved rapidly and AI coding tools are changing how we develop. For Databricks developers, tools like Cursor IDE offer significant productivity gains when used correctly. The difference between frustration and success comes down to providing the proper context. In this article and video, I explain recommendations to use Cursor with Databricks.… Continue Reading

Essential Best Practices for Data Engineers on Databricks

By Dustin Vannoy Jan 5, 2025 / Leave a comment

Data engineers and scientists should apply software development best practices to enhance their processes, particularly on Databricks, which offers valuable integrations. Key focuses include version control, automated testing, and a structured development lifecycle. By adopting these practices, teams can improve quality and reliability in data projects while facilitating faster feature delivery.

Azure Databricks with Log Analytics – Updated for DBR 11.3+

By Dustin Vannoy Jan 7, 2024 / 1 Comment

This is an updated video and writeup on setting up and using Log Analytics with your Azure Databricks logs. Some of the content overlaps with what I shared in the past, but these instructions are valid for Databricks Runtimes 11.3+. Log Analytics provides a way to collect and query logs in Azure. For teams that… Continue Reading

Databricks CI/CD: Intro to Asset Bundles (DABs)

By Dustin Vannoy Oct 3, 2023 / 2 Comments

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or… Continue Reading

Apache Spark DataKickstart: First Spark SQL Application

By Dustin Vannoy May 18, 2023 / 1 Comment

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache… Continue Reading

Apache Spark DataKickstart: First PySpark Application

By Dustin Vannoy May 1, 2023 / 1 Comment

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed… Continue Reading

Data Engineer Question and Answer

By Dustin Vannoy Jan 19, 2023 / 2 Comments

An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what others have asked me in the past, so I decided to capture my responses here. I link to prior posts and other resources when possible… Continue Reading

Getting Started with Spark Structured Streaming – Current 22

By Dustin Vannoy Oct 5, 2022 / Leave a comment

I am honored to speak at Current 22. The example notebook that I walk through towards the end is available at https://github.com/datakickstart/datakickstart-databricks-workspace/blob/main/stackoverflow/stackoverflow_streaming.py.

Ingest tables in parallel with an Apache Spark notebook using multithreading

By Dustin Vannoy May 6, 2022 / 2 Comments

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Tag: Data Engineer

Stay informed