I believe AI coding is a big part of the future for data professionals—including data engineering, data science, and analytics engineering. This means that adopting AI for development will be…
Continue Reading
The tech industry has evolved rapidly and AI coding tools are changing how we develop. For Databricks developers, tools like Cursor IDE offer significant productivity gains when used correctly. The…
Continue Reading
Data engineers and scientists should apply software development best practices to enhance their processes, particularly on Databricks, which offers valuable integrations. Key focuses include version control, automated testing, and a structured development lifecycle. By adopting these practices, teams can improve quality and reliability in data projects while facilitating faster feature delivery.
Continue Reading
This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction…
Continue Reading
Monitoring is important, so I’ve covered the topic a few times in the past. I’ve talked about collecting your Spark application logs and Spark metrics. These are a good way…
Continue Reading
This is an updated video and writeup on setting up and using Log Analytics with your Azure Databricks logs. Some of the content overlaps with what I shared in the…
Continue Reading
Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup…
Continue ReadingEvery Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk…
Continue Reading
An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what…
Continue Reading
If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.
Continue Reading
Intro Let’s walk through the fundamentals of using Kusto Query Language (KQL) to query your logs in Azure Log Analytics. Check out the video to see it in action and…
Continue Reading
Original video Updated Video Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this…
Continue Reading
In this series I share about monitoring Apache Spark with Azure Databricks. Most of the content is relevant even if using open source Apache Spark or any other managed Spark…
Continue Reading
The question is raised often, “What programming language should we choose for our Apache Spark project?” The short answer I give is to choose between Scala or Python. I admit,…
Continue Reading
In this video, I share with you about Apache Spark using the Python language, often referred to as PySpark. We’ll walk through a quick demo on Azure Synapse Analytics, an…
Continue Reading
In this video, I share with you about Apache Spark using the Scala language. We’ll walk through a quick demo on Azure Synapse Analytics, an integrated platform for analytics within…
Continue Reading
Spark .NET is the C# API for Apache Spark - a popular platform for big data processing. This demo is for you if you are curious to see a sample Spark .NET program in action or are interested in seeing Azure Synapse serverless Apache Spark notebooks. This demo includes guidance of how you can follow along to build a Spark .NET data load that reads linked sample data, transforms data, joins to a lookup table, and saves as a Delta Lake file to your Azure Data Lake Storage Gen2 account.
Continue Reading
As a data engineer, you should not be trying to convince your colleagues that everything can be a scheduled batch job. It's time to learn how to building streaming data pipelines. For many data engineers, Apache Kafka is the go to platform for enabling real-time data pipelines. Let's quickly cover why and how to get started.
Continue Reading
Data engineer roles vary but some core traits stand out for any data engineer. If you missed it, check out my first posts in this series on What is a Data Engineer? and Data Engineer Skills for Success. Let's finish off this series with the traits I see as most critical for success as a data engineer.
Continue Reading
Data engineers job descriptions vary significantly as they are asked to work on many different projects. Yet, there are categories of skills that are consistently desired in a data engineer and serve as a foundation for learning new technologies. Here are the skills I see as most critical for success as a data engineer.
Continue ReadingDustin Vannoy is a data engineer and solutions architect. His specialties are modern data pipelines and data lakehouse architectures. He loves to share knowledge with the data science and engineering community.
This site is a resource for you to learn about modern data technologies and practices, from kickstart tutorials to blog posts about the latest tips, tricks, and trends. If you are new to data engineering or data science check out the Data Kickstart tutorials.
Subscribe to get occasional email updates
