I am honored to speak at Current 22. The example notebook that I walk through towards the end is available at https://github.com/datakickstart/datakickstart-databricks-workspace/blob/main/stackoverflow/stackoverflow_streaming.py.
If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using threads. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.
Intro Let’s walk through the fundamentals of using Kusto Query Language (KQL) to query your logs in Azure Log Analytics. Check out the video to see it in action and keep reading for more code examples and written steps to run queries. This covers a few basics as well as a complex query used to… Continue Reading
Log Analytics provides a way to easily query Spark logs and setup alerts in Azure. This provides a huge help when monitoring Apache Spark. In this video I walk through the setup steps and quick demo of this capability for the Azure Databricks log4j output and the Spark metrics. I include written instructions and troubleshooting… Continue Reading
In this series I share about monitoring Apache Spark with Azure Databricks. Most of the content is relevant even if using open source Apache Spark or any other managed Spark service. I will be adding to this playlist and would love suggestions on what questions you still have about monitoring your Apache Spark workloads.
Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If… Continue Reading
Hearing a lot of mention of Data Lakes but still not sure what that means or why anyone cares? This video will cover a brief introduction to what a Data Lake is and why so many organizations are adding them to their analytics ecosystem. To show what interacting with a data lake may look like for a typical data analyst, I included a demo of how you would use Spark SQL to query the data lake from Azure Databricks.
When getting started with Azure Databricks for data processing and analytics, you need to create at least one cluster to get started. Check out the video for a quick overview of how to do this from the Azure Portal. I include a quick description of the options you have and an overview of what cluster… Continue Reading
In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more. Our customers and business leaders see information is being created all the time and realize it should be available much sooner. While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size.
This presentation covers why we need to shift some of our workloads from batch data jobs to streaming in real-time. We dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.
With the shift to data lakes that use distributed file storage as the foundation, we have been missing the reliability that relational databases provides. Databricks Delta is a data management system focused on bringing more reliability and performance into our data lakes. It sits on top of existing storage and the API is very similar to reading and writing to files from Spark already. This session will present the overview of Delta Lake, why it may be a better option than standard data lake storage, and how you can use it from Azure Databricks.