I believe AI coding is a big part of the future for data professionals—including data engineering, data science, and analytics engineering. This means that adopting AI for development will be critical for career success. Since the Cursor article and video, I’ve been digging into the AI coding space more and using Claude Code as well,… Continue Reading
Cursor with Databricks: AI Enhanced Development
The tech industry has evolved rapidly and AI coding tools are changing how we develop. For Databricks developers, tools like Cursor IDE offer significant productivity gains when used correctly. The difference between frustration and success comes down to providing the proper context. In this article and video, I explain recommendations to use Cursor with Databricks.… Continue Reading
Essential Best Practices for Data Engineers on Databricks
Data engineers and scientists should apply software development best practices to enhance their processes, particularly on Databricks, which offers valuable integrations. Key focuses include version control, automated testing, and a structured development lifecycle. By adopting these practices, teams can improve quality and reliability in data projects while facilitating faster feature delivery.
Databricks Asset Bundles: Advanced Examples
This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction please see my first post on Databricks Asset Bundles. The github repository I use will probably be first to update with new examples, however I… Continue Reading
Databricks CI/CD: Intro to Asset Bundles (DABs)
Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or… Continue Reading
Apache Spark DataKickstart: Read and Write with PySpark
Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk through some of the most common format and cloud storage locations for reading and writing with Spark. We’ll save some of the advanced Delta Lake… Continue Reading
Apache Spark DataKickstart: First Spark SQL Application
Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache… Continue Reading
Apache Spark DataKickstart: First PySpark Application
Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed… Continue Reading
Apache Spark DataKickstart – Introduction
In this video I provide introduction to Apache Spark as part of my YouTube course Apache Spark DataKickstart. This video covers why Spark is popular, what it really is, and a bit about ways to run Apache Spark. Please check out other videos in this series by selecting the relevant playlist or subscribe and turn… Continue Reading
Azure Synapse Analytics Kickstart
In this post I introduce some of the core capabilities of Azure Synapse Analytics and when they are used. I present from the perspective of data engineer but it should be easy to translate what is most useful for analysts and data scientists also. Please continue reading for a quick walkthrough of the capabilities and… Continue Reading
