Claude Code Essentials for Data Professionals

I believe AI coding is a big part of the future for data professionals—including data engineering, data science, and analytics engineering. This means that adopting AI for development will be…

Cursor with Databricks: AI Enhanced Development

The tech industry has evolved rapidly and AI coding tools are changing how we develop. For Databricks developers, tools like Cursor IDE offer significant productivity gains when used correctly. The…

Essential Best Practices for Data Engineers on Databricks

Data engineers and scientists should apply software development best practices to enhance their processes, particularly on Databricks, which offers valuable integrations. Key focuses include version control, automated testing, and a structured development lifecycle. By adopting these practices, teams can improve quality and reliability in data projects while facilitating faster feature delivery.

Databricks Asset Bundles: Advanced Examples

This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction…

Databricks Monitoring with System Tables

Monitoring is important, so I’ve covered the topic a few times in the past. I’ve talked about collecting your Spark application logs and Spark metrics. These are a good way…

Azure Databricks with Log Analytics – Updated for DBR 11.3+

This is an updated video and writeup on setting up and using Log Analytics with your Azure Databricks logs. Some of the content overlaps with what I shared in the…

Databricks CI/CD: Intro to Asset Bundles (DABs)

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup…

Apache Spark DataKickstart: Read and Write with PySpark

Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk…

Data Engineer Question and Answer

An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what…

Ingest tables in parallel with an Apache Spark notebook using multithreading

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Querying Log Analytics using KQL

Intro Let’s walk through the fundamentals of using Kusto Query Language (KQL) to query your logs in Azure Log Analytics. Check out the video to see it in action and…

Monitoring Azure Databricks with Log Analytics

Original video Updated Video Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this…

Spark Monitoring video series

In this series I share about monitoring Apache Spark with Azure Databricks. Most of the content is relevant even if using open source Apache Spark or any other managed Spark…

Best Language for Apache Spark

The question is raised often, “What programming language should we choose for our Apache Spark project?” The short answer I give is to choose between Scala or Python. I admit,…

Azure Synapse Spark with Python

In this video, I share with you about Apache Spark using the Python language, often referred to as PySpark. We’ll walk through a quick demo on Azure Synapse Analytics, an…

Azure Synapse Spark with Scala

In this video, I share with you about Apache Spark using the Scala language. We’ll walk through a quick demo on Azure Synapse Analytics, an integrated platform for analytics within…

Azure Synapse Spark .NET (C#)

Spark .NET is the C# API for Apache Spark - a popular platform for big data processing. This demo is for you if you are curious to see a sample Spark .NET program in action or are interested in seeing Azure Synapse serverless Apache Spark notebooks. This demo includes guidance of how you can follow along to build a Spark .NET data load that reads linked sample data, transforms data, joins to a lookup table, and saves as a Delta Lake file to your Azure Data Lake Storage Gen2 account.

Why Apache Kafka?

As a data engineer, you should not be trying to convince your colleagues that everything can be a scheduled batch job. It's time to learn how to building streaming data pipelines. For many data engineers, Apache Kafka is the go to platform for enabling real-time data pipelines. Let's quickly cover why and how to get started.

Top Traits of a Data Engineer

Data engineer roles vary but some core traits stand out for any data engineer. If you missed it, check out my first posts in this series on What is a Data Engineer? and Data Engineer Skills for Success. Let's finish off this series with the traits I see as most critical for success as a data engineer.

Data Engineer Skills for Success

Data engineers job descriptions vary significantly as they are asked to work on many different projects. Yet, there are categories of skills that are consistently desired in a data engineer and serve as a foundation for learning new technologies. Here are the skills I see as most critical for success as a data engineer.