Apache Spark | DUSTIN VANNOY

Azure Databricks with Log Analytics – Updated for DBR 11.3+

By Dustin Vannoy Jan 7, 2024 / 1 Comment

This is an updated video and writeup on setting up and using Log Analytics with your Azure Databricks logs. Some of the content overlaps with what I shared in the past, but these instructions are valid for Databricks Runtimes 11.3+. Log Analytics provides a way to collect and query logs in Azure. For teams that… Continue Reading

Data + AI Summit 2023 – Data Engineer key takeaways

By Dustin Vannoy Jun 30, 2023 / Leave a comment

Data + AI Summit 2023 has just completed with many announcements and deep dives. I attended virtually this year but was just as excited as the in-person attendees for some of the new capabilities that were shared. After watching the keynote presentations and tracking additional posts about new features, I want to summarize the top… Continue Reading

Apache Spark DataKickstart: Read and Write with PySpark

By Dustin Vannoy Jun 21, 2023 / 1 Comment

Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk through some of the most common format and cloud storage locations for reading and writing with Spark. We’ll save some of the advanced Delta Lake… Continue Reading

Apache Spark DataKickstart: First Spark SQL Application

By Dustin Vannoy May 18, 2023 / 1 Comment

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache… Continue Reading

Apache Spark DataKickstart: First PySpark Application

By Dustin Vannoy May 1, 2023 / 1 Comment

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed… Continue Reading

Apache Spark DataKickstart – Introduction

By Dustin Vannoy May 1, 2023 / Leave a comment

In this video I provide introduction to Apache Spark as part of my YouTube course Apache Spark DataKickstart. This video covers why Spark is popular, what it really is, and a bit about ways to run Apache Spark. Please check out other videos in this series by selecting the relevant playlist or subscribe and turn… Continue Reading

Questions Answered: Parallel Load in Spark Notebook

By Dustin Vannoy Jan 9, 2023 / 1 Comment

I received many questions on my tutorial Ingest tables in parallel with an Apache Spark notebook using multithreading. In this video and post I address some of the questions that I couldn’t just answer in the YouTube comments. Watch the video for more complete answers but here are quick responses with links to examples where… Continue Reading

Getting Started with Spark Structured Streaming – Current 22

By Dustin Vannoy Oct 5, 2022 / Leave a comment

I am honored to speak at Current 22. The example notebook that I walk through towards the end is available at https://github.com/datakickstart/datakickstart-databricks-workspace/blob/main/stackoverflow/stackoverflow_streaming.py.

Ingest tables in parallel with an Apache Spark notebook using multithreading

By Dustin Vannoy May 6, 2022 / 2 Comments

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Azure Synapse Spark: External Python Packages

By Dustin Vannoy Jan 20, 2022 / 1 Comment

When working with an Apache Spark environment you may need to install external libraries or custom packages. In this post I share the steps for installing Python packages to Azure Synapse serverless Apache Spark pools. For Python code the libraries are packages as wheel (.whl) files. You can also install Python packages that are available… Continue Reading

Tag: Apache Spark

Stay informed