Incremental Data Loading with Azure Databricks

My talk for PASS Summit 2023 is about how to load data incrementally, such as from Change Data Feed or streaming a log of events. Below are some additional thoughts and links to resources for easy reference.

Presentation description:

There has been an increasing push to load data incrementally throughout the day or even within minutes. Apache Spark and Delta Lake are a great option to do this at a large scale. These tools also integrate well with other Azure data platform capabilities. Using Azure Databricks for this type of processing gives us the power of Apache Spark and Delta Lake, plus added benefits like auto-loader and Delta Live Tables. In this session you will learn best practices for incremental data processing and see several techniques for building these data pipelines using Azure Databricks.

Links mentioned in talk

Code repository – https://github.com/datakickstart/datakickstart-databricks-workspace/tree/pass_summit_2023/pass_summit_2023

Autoloader Blog Post

Cosmos DB integrate to Databricks

Optimize Delta MERGE

Detecting deletes with PySpark

Additional training

https://www.databricks.com/learn

https://learn.microsoft.com/en-us/training/paths/data-engineer-azure-databricks/

https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables

https://learn.microsoft.com/en-us/training/paths/get-started-data-engineering/

Design a Data Warehouse Schema

Leave a comment

Leave a Reply