Data + AI Summit 2023 has just completed with many announcements and deep dives. I attended virtually this year but was just as excited as the in-person attendees for some of the new capabilities that were shared. After watching the keynote presentations and tracking additional posts about new features, I want to summarize the top… Continue Reading
Apache Spark DataKickstart: Read and Write with PySpark
Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk through some of the most common format and cloud storage locations for reading and writing with Spark. We’ll save some of the advanced Delta Lake… Continue Reading
Apache Spark DataKickstart: First Spark SQL Application
Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache… Continue Reading
Apache Spark DataKickstart: First PySpark Application
Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed… Continue Reading
Apache Spark DataKickstart – Introduction
In this video I provide introduction to Apache Spark as part of my YouTube course Apache Spark DataKickstart. This video covers why Spark is popular, what it really is, and a bit about ways to run Apache Spark. Please check out other videos in this series by selecting the relevant playlist or subscribe and turn… Continue Reading
Questions Answered: Parallel Load in Spark Notebook
I received many questions on my tutorial Ingest tables in parallel with an Apache Spark notebook using multithreading. In this video and post I address some of the questions that I couldn’t just answer in the YouTube comments. Watch the video for more complete answers but here are quick responses with links to examples where… Continue Reading
Getting Started with Spark Structured Streaming – Current 22
I am honored to speak at Current 22. The example notebook that I walk through towards the end is available at https://github.com/datakickstart/datakickstart-databricks-workspace/blob/main/stackoverflow/stackoverflow_streaming.py.
Monitor Synapse Spark with Log Analytics
Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through the setup steps and quick demo of this capability for the Azure Synapse Spark log4j output. I include written instructions and troubleshooting guidance in this… Continue Reading
Ingest tables in parallel with an Apache Spark notebook using multithreading
If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.
Azure Synapse Spark: External Python Packages
When working with an Apache Spark environment you may need to install external libraries or custom packages. In this post I share the steps for installing Python packages to Azure Synapse serverless Apache Spark pools. For Python code the libraries are packages as wheel (.whl) files. You can also install Python packages that are available… Continue Reading