In this video, I share with you about Apache Spark using the Python language, often referred to as PySpark. We’ll walk through a quick demo on Azure Synapse Analytics, an integrated platform for analytics within Microsoft Azure cloud. This short demo is meant for those who are curious about PySpark or just want to get… Continue Reading
In this video, I share with you about Apache Spark using the Scala language. We’ll walk through a quick demo on Azure Synapse Analytics, an integrated platform for analytics within Microsoft Azure cloud. This short demo is meant for those who are curious about Spark with Scala or just want to get a peek at… Continue Reading
Spark .NET is the C# API for Apache Spark - a popular platform for big data processing. This demo is for you if you are curious to see a sample Spark .NET program in action or are interested in seeing Azure Synapse serverless Apache Spark notebooks. This demo includes guidance of how you can follow along to build a Spark .NET data load that reads linked sample data, transforms data, joins to a lookup table, and saves as a Delta Lake file to your Azure Data Lake Storage Gen2 account.
Azure Synapse Analytics just went Public Preview so now you can access all kinds of capability. Here is a quick introduction to what it is and why it matters.
Hearing a lot of mention of Data Lakes but still not sure what that means or why anyone cares? This video will cover a brief introduction to what a Data Lake is and why so many organizations are adding them to their analytics ecosystem. To show what interacting with a data lake may look like for a typical data analyst, I included a demo of how you would use Spark SQL to query the data lake from Azure Databricks.
If you are working with Azure Databricks (or many other Azure resources), you may come across the need for a Service Principal in order to configure access to different resources. The steps are fairly straight forward but the terminology is not consistent so this video will walk through the steps and describe where to find the values to use when you authenticate.
When getting started with Azure Databricks for data processing and analytics, you need to create at least one cluster to get started. Check out the video for a quick overview of how to do this from the Azure Portal. I include a quick description of the options you have and an overview of what cluster… Continue Reading
In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more. Our customers and business leaders see information is being created all the time and realize it should be available much sooner. While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size.
This presentation covers why we need to shift some of our workloads from batch data jobs to streaming in real-time. We dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.
With the shift to data lakes that use distributed file storage as the foundation, we have been missing the reliability that relational databases provides. Databricks Delta is a data management system focused on bringing more reliability and performance into our data lakes. It sits on top of existing storage and the API is very similar to reading and writing to files from Spark already. This session will present the overview of Delta Lake, why it may be a better option than standard data lake storage, and how you can use it from Azure Databricks.
Slides from my PASS Summit presentation: https://www.slideshare.net/DustinVannoy/passsummit2019azurestorageoptionsforanalytics