Databricks CI/CD: Intro to Asset Bundles (DABs)

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.

Databricks Asset Bundles full example

A full repo with examples is available here: https://github.com/datakickstart/datakickstart_dabs

After you read (or watch) the intro material, go check out my advanced Databricks Asset Bundles post for more patterns and examples.

Let’s start by looking at the bundle file which defines some base settings, a few target environments, and specifies other resources to include.

	# yaml-language-server: $schema=bundle_config_schema.json

	# This is a Databricks asset bundle definition for datakickstart_dabs.
	# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
	bundle:
	name: datakickstart_dabs

	include:
	– resources/*.yml

	targets:
	dev:
	# We use 'mode: development' to make sure everything deployed to this target gets a prefix
	# like '[dev my_user_name]'. Setting this mode also disables any schedules and
	# automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
	mode: development
	default: true
	workspace:
	host: https://adb-7923111111111114.14.azuredatabricks.net

	staging:
	# For staging deployments, we only have a single copy, so we override the
	# workspace.root_path default of
	# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
	# to a path that is not specific to the current user.
	mode: production
	workspace:
	host: https://adb-7923111111111114.14.azuredatabricks.net
	root_path: /Shared/.bundle/${bundle.target}/${bundle.name}
	run_as:
	user_name: training@dustinvannoy.com

	# The 'prod' target, used for production deployment.
	prod:
	# For production deployments, we only have a single copy, so we override the
	# workspace.root_path default of
	# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
	# to a path that is not specific to the current user.
	mode: production
	workspace:
	host: https://adb-7923111111111114.14.azuredatabricks.net
	root_path: /Shared/.bundle/${bundle.target}/${bundle.name}
	run_as:
	# This runs as training@dustinvannoy.com in production. Alternatively,
	# a service principal could be used here using service_principal_name
	# (see Databricks documentation).
	user_name: training@dustinvannoy.com

view raw databricks.yml hosted with ❤ by GitHub

In the include section we pointed to a folder which has multiple .yml files that define resources that will be included when deploying the bundle. First is the a multi-step workflow named resources/datakickstart_dabs_jobs.yml.

	# The main job for datakickstart_dabs
	resources:
	jobs:
	datakickstart_dabs_job:
	name: datakickstart_dabs_job_${bundle.target}

	schedule:
	quartz_cron_expression: '0 30 19 * * ?'
	timezone_id: America/Los_Angeles

	email_notifications:
	on_failure:
	– training@dustinvannoy.com

	tasks:
	– task_key: notebook_task
	job_cluster_key: job_cluster
	notebook_task:
	notebook_path: ../src/notebook.ipynb
	libraries:
	– pypi:
	package: pytest
	max_retries: 0

	– task_key: refresh_pipeline
	depends_on:
	– task_key: notebook_task
	pipeline_task:
	pipeline_id: ${resources.pipelines.datakickstart_dabs_pipeline.id}
	max_retries: 0

	– task_key: main_task
	depends_on:
	– task_key: refresh_pipeline
	job_cluster_key: job_cluster
	python_wheel_task:
	package_name: datakickstart_dabs
	entry_point: main
	libraries:
	– whl: ../dist/*.whl
	max_retries: 0

	job_clusters:
	– job_cluster_key: job_cluster
	new_cluster:
	spark_version: 13.3.x-scala2.12
	node_type_id: Standard_D3_v2
	autoscale:
	min_workers: 1
	max_workers: 2

view raw datakickstart_dabs_job.yml hosted with ❤ by GitHub

The second step of the workflow is a Delta Live Tables pipeline which is defined in file resources/datakickstart_dabs_pipeline.yml.

	# The main pipeline for datakickstart_dabs
	resources:
	pipelines:
	datakickstart_dabs_pipeline:
	name: datakickstart_dabs_pipeline_${bundle.target}
	target: datakickstart_dabs_${bundle.target}
	libraries:
	– notebook:
	path: ../src/dlt_pipeline.ipynb
	clusters:
	– label: "default"
	num_workers: 2

	configuration:
	bundle.sourcePath: /Workspace/${workspace.file_path}/src

view raw datakickstart_dabs_pipeline.yml hosted with ❤ by GitHub

Additional examples

Add libraries

	tasks:
	– task_key: notebook_task
	job_cluster_key: job_cluster
	notebook_task:
	notebook_path: ../src/notebook.ipynb
	libraries:
	– pypi:
	package: pytest
	– pypi:
	package: requests
	– maven:
	coordinates: com.azure:azure-messaging-eventhubs:5.16.0
	– whl: ../dist/*.whl

view raw task_with_libraries.yml hosted with ❤ by GitHub

References

Data & AI Summit Presentation

Data & AI Summit Repo

Add existing job to bundle

DUSTIN VANNOY

Databricks CI/CD: Intro to Asset Bundles (DABs)

Databricks Asset Bundles full example

Additional examples

Add libraries

References

Like this:

2 Comments

Leave a ReplyCancel reply

About

Featured Posts

Claude Code Essentials for Data Professionals

Cursor with Databricks: AI Enhanced Development

OSS Spotlight: Unity Catalog

Essential Best Practices for Data Engineers on Databricks

PASS 2024 – Databricks Resources for DevX and CICD

Databricks Asset Bundles: Advanced Examples

Databricks Asset Bundles full example

Additional examples

Add libraries

References

Share this:

Like this:

Leave a ReplyCancel reply

About

Stay informed

Featured Posts

Discover more from DUSTIN VANNOY