Why Apache Kafka?

As a data engineer, you should not be trying to convince your colleagues that everything can be a scheduled batch job. It's time to learn how to building streaming data pipelines. For many data engineers, Apache Kafka is the go to platform for enabling real-time data pipelines. Let's quickly cover why and how to get started.


Stream image

Spark Streaming with Azure Databricks

In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more.  Our customers and business leaders see information is being created all the time and realize it should be available much sooner.  While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size.

This presentation covers why we need to shift some of our workloads from batch data jobs to streaming in real-time.  We dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.