Databricks CI/CD: Intro to Asset Bundles (DABs)

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.

Databricks Asset Bundles full example

A full repo with examples is available here: https://github.com/datakickstart/datakickstart_dabs

After you read (or watch) the intro material, go check out my advanced Databricks Asset Bundles post for more patterns and examples.

Let’s start by looking at the bundle file which defines some base settings, a few target environments, and specifies other resources to include.

# yaml-language-server: $schema=bundle_config_schema.json
# This is a Databricks asset bundle definition for datakickstart_dabs.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
name: datakickstart_dabs
include:
– resources/*.yml
targets:
dev:
# We use 'mode: development' to make sure everything deployed to this target gets a prefix
# like '[dev my_user_name]'. Setting this mode also disables any schedules and
# automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
mode: development
default: true
workspace:
host: https://adb-7923111111111114.14.azuredatabricks.net
staging:
# For staging deployments, we only have a single copy, so we override the
# workspace.root_path default of
# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
# to a path that is not specific to the current user.
mode: production
workspace:
host: https://adb-7923111111111114.14.azuredatabricks.net
root_path: /Shared/.bundle/${bundle.target}/${bundle.name}
run_as:
user_name: training@dustinvannoy.com
# The 'prod' target, used for production deployment.
prod:
# For production deployments, we only have a single copy, so we override the
# workspace.root_path default of
# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
# to a path that is not specific to the current user.
mode: production
workspace:
host: https://adb-7923111111111114.14.azuredatabricks.net
root_path: /Shared/.bundle/${bundle.target}/${bundle.name}
run_as:
# This runs as training@dustinvannoy.com in production. Alternatively,
# a service principal could be used here using service_principal_name
# (see Databricks documentation).
user_name: training@dustinvannoy.com
view raw databricks.yml hosted with ❤ by GitHub

In the include section we pointed to a folder which has multiple .yml files that define resources that will be included when deploying the bundle. First is the a multi-step workflow named resources/datakickstart_dabs_jobs.yml.

# The main job for datakickstart_dabs
resources:
jobs:
datakickstart_dabs_job:
name: datakickstart_dabs_job_${bundle.target}
schedule:
quartz_cron_expression: '0 30 19 * * ?'
timezone_id: America/Los_Angeles
email_notifications:
on_failure:
– training@dustinvannoy.com
tasks:
– task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
libraries:
– pypi:
package: pytest
max_retries: 0
– task_key: refresh_pipeline
depends_on:
– task_key: notebook_task
pipeline_task:
pipeline_id: ${resources.pipelines.datakickstart_dabs_pipeline.id}
max_retries: 0
– task_key: main_task
depends_on:
– task_key: refresh_pipeline
job_cluster_key: job_cluster
python_wheel_task:
package_name: datakickstart_dabs
entry_point: main
libraries:
– whl: ../dist/*.whl
max_retries: 0
job_clusters:
– job_cluster_key: job_cluster
new_cluster:
spark_version: 13.3.x-scala2.12
node_type_id: Standard_D3_v2
autoscale:
min_workers: 1
max_workers: 2

The second step of the workflow is a Delta Live Tables pipeline which is defined in file resources/datakickstart_dabs_pipeline.yml.

# The main pipeline for datakickstart_dabs
resources:
pipelines:
datakickstart_dabs_pipeline:
name: datakickstart_dabs_pipeline_${bundle.target}
target: datakickstart_dabs_${bundle.target}
libraries:
notebook:
path: ../src/dlt_pipeline.ipynb
clusters:
label: "default"
num_workers: 2
configuration:
bundle.sourcePath: /Workspace/${workspace.file_path}/src

Additional examples

Add libraries

tasks:
– task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
libraries:
– pypi:
package: pytest
– pypi:
package: requests
– maven:
coordinates: com.azure:azure-messaging-eventhubs:5.16.0
– whl: ../dist/*.whl

References

Data & AI Summit Presentation

Data & AI Summit Repo

Add existing job to bundle

2 Comments

Leave a Reply