Spark Summit Takeaways

Wrapping up my attendance at Spark + AI Summit 2020 and I found a lot of value. Here are my quick takeaways to try and save you time. To keep it real, some sessions were a big miss for me either due to too much detail or not enough focus, but some were awesome. If… Continue Reading


Data Lake Introduction

Hearing a lot of mention of Data Lakes but still not sure what that means or why anyone cares? This video will cover a brief introduction to what a Data Lake is and why so many organizations are adding them to their analytics ecosystem. To show what interacting with a data lake may look like for a typical data analyst, I included a demo of how you would use Spark SQL to query the data lake from Azure Databricks.


Stream image

Spark Streaming with Azure Databricks

In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more.  Our customers and business leaders see information is being created all the time and realize it should be available much sooner.  While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size.

This presentation covers why we need to shift some of our workloads from batch data jobs to streaming in real-time.  We dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.


Delta Lake on Azure Databricks

With the shift to data lakes that use distributed file storage as the foundation, we have been missing the reliability that relational databases provides. Databricks Delta is a data management system focused on bringing more reliability and performance into our data lakes. It sits on top of existing storage and the API is very similar to reading and writing to files from Spark already. This session will present the overview of Delta Lake, why it may be a better option than standard data lake storage, and how you can use it from Azure Databricks.


Azure Databricks from PyCharm IDE

I am pleased to share with you a new, improved way of developing for Azure Databricks from your IDE – Databricks Connect! Databricks Connect is a client library to run large scale Spark jobs on your Databricks cluster from anywhere you can import the library (Python, R, Scala, Java). It allows you to develop from your computer with your normal IDE features like auto complete, linting, and debugging. You can work in an IDE you are familiar with but have the Spark actions send out to the cluster, with no need to install Spark locally.


Apache Spark Introduction

This video we will quickly cover Apache Spark.  The goal is to cover why use Spark and where it fits in the data ecosystem.  If you want to just get hands on with Spark, check out one of my next videos on Spark and Databricks. Watch the video to get my overview of Spark and… Continue Reading