Spark | DUSTIN VANNOY

Azure Databricks with Log Analytics – Updated for DBR 11.3+

By Dustin Vannoy Jan 7, 2024 / 1 Comment

This is an updated video and writeup on setting up and using Log Analytics with your Azure Databricks logs. Some of the content overlaps with what I shared in the past, but these instructions are valid for Databricks Runtimes 11.3+. Log Analytics provides a way to collect and query logs in Azure. For teams that… Continue Reading

Apache Spark DataKickstart: Read and Write with PySpark

By Dustin Vannoy Jun 21, 2023 / 1 Comment

Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk through some of the most common format and cloud storage locations for reading and writing with Spark. We’ll save some of the advanced Delta Lake… Continue Reading

Ingest tables in parallel with an Apache Spark notebook using multithreading

By Dustin Vannoy May 6, 2022 / 2 Comments

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Spark Streaming join to Slow Changing Data

By Dustin Vannoy Jun 9, 2021 / Leave a comment

f you are building data pipelines for a video streaming site, you would need to consume analytics about video views in real time. Assume you need to look up additional user attributes like the subscription level, that information will change very infrequently. However, once that change happens its important to start tying usage to the correct subscription right away. So you need to find the best way to lookup that info in Apache Spark. With Delta Lake format, the batch data frame will update in memory without restarting the stream. The video in this post shows an example of this in action. Delta Lake supports updates via the merge statement so you keep the data up to date in your file system and Spark will also update its in memory data frame.

Best Language for Apache Spark

By Dustin Vannoy Apr 7, 2021 / 1 Comment

The question is raised often, “What programming language should we choose for our Apache Spark project?” The short answer I give is to choose between Scala or Python. I admit, this is only slightly more helpful than saying it depends, which I try to avoid. The real question is what are the tradeoffs between the… Continue Reading

Spark Summit Takeaways

By Dustin Vannoy Jun 26, 2020 / Leave a comment

Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If… Continue Reading

Azure Synapse Analytics: What the WHAT?

By Dustin Vannoy May 19, 2020 / 2 Comments

Azure Synapse Analytics just went Public Preview so now you can access all kinds of capability. Here is a quick introduction to what it is and why it matters.

Journey of a Data Engineer: From BI Developer to Data Engineer

By Dustin Vannoy Apr 26, 2020 / 4 Comments

This is part 2 of my Journey of a Data Engineer series which all started from the question “What’s the best path to be a great data engineer?” Check out Part 1: From College to BI Developer for the path from college through my first role as a BI consultant. In this post I’ll cover the steps… Continue Reading

Data Lake Introduction

By Dustin Vannoy Mar 5, 2020 / Leave a comment

Hearing a lot of mention of Data Lakes but still not sure what that means or why anyone cares? This video will cover a brief introduction to what a Data Lake is and why so many organizations are adding them to their analytics ecosystem. To show what interacting with a data lake may look like for a typical data analyst, I included a demo of how you would use Spark SQL to query the data lake from Azure Databricks.

Spark Streaming with Azure Databricks

By Dustin Vannoy Dec 4, 2019 / Leave a comment

In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more. Our customers and business leaders see information is being created all the time and realize it should be available much sooner. While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size.

This presentation covers why we need to shift some of our workloads from batch data jobs to streaming in real-time. We dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.

Category: Spark

Stay informed