This is an updated video and writeup on setting up and using Log Analytics with your Azure Databricks logs. Some of the content overlaps with what I shared in the past, but these instructions are valid for Databricks Runtimes 11.3+. Log Analytics provides a way to collect and query logs in Azure. For teams that… Continue Reading
Apache Spark DataKickstart: Read and Write with PySpark
Every Spark pipeline involves reading data from a data source or table. For data engineers we usually end the pipelines by writing the transformed data. In this tutorial we walk through some of the most common format and cloud storage locations for reading and writing with Spark. We’ll save some of the advanced Delta Lake… Continue Reading
Ingest tables in parallel with an Apache Spark notebook using multithreading
If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.
Azure Synapse Spark: External Python Packages
When working with an Apache Spark environment you may need to install external libraries or custom packages. In this post I share the steps for installing Python packages to Azure Synapse serverless Apache Spark pools. For Python code the libraries are packages as wheel (.whl) files. You can also install Python packages that are available… Continue Reading
Azure Synapse Spark: Add Scala/Java Libraries
When working with an Apache Spark environment you may need to install third party libraries or custom packages. In this post I share the steps for installing Java or Scala libraries to Azure Synapse serverless Apache Spark pools. For Java or Scala code the libraries are packaged as JAR files that you add to the… Continue Reading
Spark Monitoring video series
In this series I share about monitoring Apache Spark with Azure Databricks. Most of the content is relevant even if using open source Apache Spark or any other managed Spark service. I will be adding to this playlist and would love suggestions on what questions you still have about monitoring your Apache Spark workloads.
Spark Streaming join to Slow Changing Data
f you are building data pipelines for a video streaming site, you would need to consume analytics about video views in real time. Assume you need to look up additional user attributes like the subscription level, that information will change very infrequently. However, once that change happens its important to start tying usage to the correct subscription right away. So you need to find the best way to lookup that info in Apache Spark. With Delta Lake format, the batch data frame will update in memory without restarting the stream. The video in this post shows an example of this in action. Delta Lake supports updates via the merge statement so you keep the data up to date in your file system and Spark will also update its in memory data frame.
Best Language for Apache Spark
The question is raised often, “What programming language should we choose for our Apache Spark project?” The short answer I give is to choose between Scala or Python. I admit, this is only slightly more helpful than saying it depends, which I try to avoid. The real question is what are the tradeoffs between the… Continue Reading
Spark Summit Takeaways
Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If… Continue Reading
Journey of a Data Engineer: From BI Developer to Data Engineer
This is part 2 of my Journey of a Data Engineer series which all started from the question “What’s the best path to be a great data engineer?” Check out Part 1: From College to BI Developer for the path from college through my first role as a BI consultant. In this post I’ll cover the steps… Continue Reading
