My talk for PASS Summit 2023 is about how to load data incrementally, such as from Change Data Feed or streaming a log of events. Below are some additional thoughts and links to resources for easy reference.
Presentation description:
There has been an increasing push to load data incrementally throughout the day or even within minutes. Apache Spark and Delta Lake are a great option to do this at a large scale. These tools also integrate well with other Azure data platform capabilities. Using Azure Databricks for this type of processing gives us the power of Apache Spark and Delta Lake, plus added benefits like auto-loader and Delta Live Tables. In this session you will learn best practices for incremental data processing and see several techniques for building these data pipelines using Azure Databricks.
Links mentioned in talk
Cosmos DB integrate to Databricks
Detecting deletes with PySpark
Additional training
https://www.databricks.com/learn
https://learn.microsoft.com/en-us/training/paths/data-engineer-azure-databricks/
https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables
https://learn.microsoft.com/en-us/training/paths/get-started-data-engineering/