Databricks CI/CD: Intro to Asset Bundles (DABs)

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.

Databricks Asset Bundles full example

A full repo with examples is available here: https://github.com/datakickstart/datakickstart_dabs

Let’s start by looking at the bundle file which defines some base settings, a few target environments, and specifies other resources to include.

# yaml-language-server: $schema=bundle_config_schema.json
# This is a Databricks asset bundle definition for datakickstart_dabs.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
name: datakickstart_dabs
include:
resources/*.yml
targets:
dev:
# We use 'mode: development' to make sure everything deployed to this target gets a prefix
# like '[dev my_user_name]'. Setting this mode also disables any schedules and
# automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
mode: development
default: true
workspace:
host: https://adb-7923111111111114.14.azuredatabricks.net
staging:
# For staging deployments, we only have a single copy, so we override the
# workspace.root_path default of
# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
# to a path that is not specific to the current user.
mode: production
workspace:
host: https://adb-7923111111111114.14.azuredatabricks.net
root_path: /Shared/.bundle/${bundle.target}/${bundle.name}
run_as:
user_name: training@dustinvannoy.com
# The 'prod' target, used for production deployment.
prod:
# For production deployments, we only have a single copy, so we override the
# workspace.root_path default of
# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
# to a path that is not specific to the current user.
mode: production
workspace:
host: https://adb-7923111111111114.14.azuredatabricks.net
root_path: /Shared/.bundle/${bundle.target}/${bundle.name}
run_as:
# This runs as training@dustinvannoy.com in production. Alternatively,
# a service principal could be used here using service_principal_name
# (see Databricks documentation).
user_name: training@dustinvannoy.com
view raw databricks.yml hosted with ❤ by GitHub

In the include section we pointed to a folder which has multiple .yml files that define resources that will be included when deploying the bundle. First is the a multi-step workflow named resources/datakickstart_dabs_jobs.yml.

# The main job for datakickstart_dabs
resources:
jobs:
datakickstart_dabs_job:
name: datakickstart_dabs_job_${bundle.target}
schedule:
quartz_cron_expression: '0 30 19 * * ?'
timezone_id: America/Los_Angeles
email_notifications:
on_failure:
training@dustinvannoy.com
tasks:
task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
libraries:
pypi:
package: pytest
max_retries: 0
task_key: refresh_pipeline
depends_on:
task_key: notebook_task
pipeline_task:
pipeline_id: ${resources.pipelines.datakickstart_dabs_pipeline.id}
max_retries: 0
task_key: main_task
depends_on:
task_key: refresh_pipeline
job_cluster_key: job_cluster
python_wheel_task:
package_name: datakickstart_dabs
entry_point: main
libraries:
whl: ../dist/*.whl
max_retries: 0
job_clusters:
job_cluster_key: job_cluster
new_cluster:
spark_version: 13.3.x-scala2.12
node_type_id: Standard_D3_v2
autoscale:
min_workers: 1
max_workers: 2

The second step of the workflow is a Delta Live Tables pipeline which is defined in file resources/datakickstart_dabs_pipeline.yml.

# The main pipeline for datakickstart_dabs
resources:
pipelines:
datakickstart_dabs_pipeline:
name: datakickstart_dabs_pipeline_${bundle.target}
target: datakickstart_dabs_${bundle.target}
libraries:
notebook:
path: ../src/dlt_pipeline.ipynb
clusters:
label: "default"
num_workers: 2
configuration:
bundle.sourcePath: /Workspace/${workspace.file_path}/src

Additional examples

Add libraries

tasks:
task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
libraries:
pypi:
package: pytest
pypi:
package: requests
maven:
coordinates: com.azure:azure-messaging-eventhubs:5.16.0
whl: ../dist/*.whl

References

Data & AI Summit Presentation

Data & AI Summit Repo

Add existing job to bundle

Leave a comment

Leave a Reply