Data + AI Summit 2023 has just completed with many announcements and deep dives. I attended virtually this year but was just as excited as the in-person attendees for some…
Continue ReadingEvery Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk…
Continue ReadingAn aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what…
Continue ReadingIf you choose to use Snowflake along with Azure for your data platform, you will have to make choices on how to load the data. Landing processed data into your…
Continue ReadingIn this tutorial we cover some basic but realistic examples of loading from CSV or Parquet files. The source data is in partitioned folders following a pattern of puYear=#### and puMonth=##, but we do not use the partition columns until the last example.
Continue ReadingLog Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through…
Continue ReadingIf we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.
Continue ReadingIn this post I introduce some of the core capabilities of Azure Synapse Analytics and when they are used. I present from the perspective of data engineer but it should…
Continue ReadingFor production uses of Azure Synapse there are benefits to implementing Continuous Integration (CI) and Continuous Deployment (CD). Implementing CI/CD includes the need to deploy the Azure infrastructure in an automated way. In this post, I share things I learned that may be helpful for you. I also have a few links to other content that was helpful for me to get an environment setup.
Continue ReadingWhen working with an Apache Spark environment you may need to install external libraries or custom packages. In this post I share the steps for installing Python packages to Azure…
Continue ReadingWhen working with an Apache Spark environment you may need to install third party libraries or custom packages. In this post I share the steps for installing Java or Scala…
Continue ReadingReal-time data processing is becoming more common in companies of all sizes. The use cases range from simple stream ingestion to complex machine learning pipelines. If you need to get started with streaming in Azure, Stream Analytics gives you a simple way to get up and running. Most of my streaming projects involve Apache Kafka and Spark which can take a lot of setup (or at least involving additional vendors to simplify the experience). Those technologies are great especially for challenging streaming pipelines, but if your data platform is within Azure you should consider if Stream Analytics will meet your needs.
Continue ReadingIntro Let’s walk through the fundamentals of using Kusto Query Language (KQL) to query your logs in Azure Log Analytics. Check out the video to see it in action and…
Continue ReadingLog Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through…
Continue ReadingIn this series I share about monitoring Apache Spark with Azure Databricks. Most of the content is relevant even if using open source Apache Spark or any other managed Spark…
Continue ReadingThe question is raised often, “What programming language should we choose for our Apache Spark project?” The short answer I give is to choose between Scala or Python. I admit,…
Continue ReadingIn this video, I share with you about Apache Spark using the Python language, often referred to as PySpark. We’ll walk through a quick demo on Azure Synapse Analytics, an…
Continue ReadingIn this video, I share with you about Apache Spark using the Scala language. We’ll walk through a quick demo on Azure Synapse Analytics, an integrated platform for analytics within…
Continue ReadingSpark .NET is the C# API for Apache Spark - a popular platform for big data processing. This demo is for you if you are curious to see a sample Spark .NET program in action or are interested in seeing Azure Synapse serverless Apache Spark notebooks. This demo includes guidance of how you can follow along to build a Spark .NET data load that reads linked sample data, transforms data, joins to a lookup table, and saves as a Delta Lake file to your Azure Data Lake Storage Gen2 account.
Continue ReadingAs a data engineer, you should not be trying to convince your colleagues that everything can be a scheduled batch job. It's time to learn how to building streaming data pipelines. For many data engineers, Apache Kafka is the go to platform for enabling real-time data pipelines. Let's quickly cover why and how to get started.
Continue ReadingDustin Vannoy is a consultant in data analytics and engineering. His specialties are modern data pipelines, data lakes, and data warehouses. He loves to share knowledge with the data science community.
This site is a resource for you to learn about modern data technologies and practices, from kickstart tutorials to blog posts about the latest tips, tricks, and trends. If you are new to data engineering or data science check out the Data Kickstart tutorials.