Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.
Databricks Asset Bundles full example
A full repo with examples is available here: https://github.com/datakickstart/datakickstart_dabs
Let’s start by looking at the bundle file which defines some base settings, a few target environments, and specifies other resources to include.
# yaml-language-server: $schema=bundle_config_schema.json | |
# This is a Databricks asset bundle definition for datakickstart_dabs. | |
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation. | |
bundle: | |
name: datakickstart_dabs | |
include: | |
– resources/*.yml | |
targets: | |
dev: | |
# We use 'mode: development' to make sure everything deployed to this target gets a prefix | |
# like '[dev my_user_name]'. Setting this mode also disables any schedules and | |
# automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. | |
mode: development | |
default: true | |
workspace: | |
host: https://adb-7923111111111114.14.azuredatabricks.net | |
staging: | |
# For staging deployments, we only have a single copy, so we override the | |
# workspace.root_path default of | |
# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name} | |
# to a path that is not specific to the current user. | |
mode: production | |
workspace: | |
host: https://adb-7923111111111114.14.azuredatabricks.net | |
root_path: /Shared/.bundle/${bundle.target}/${bundle.name} | |
run_as: | |
user_name: training@dustinvannoy.com | |
# The 'prod' target, used for production deployment. | |
prod: | |
# For production deployments, we only have a single copy, so we override the | |
# workspace.root_path default of | |
# /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name} | |
# to a path that is not specific to the current user. | |
mode: production | |
workspace: | |
host: https://adb-7923111111111114.14.azuredatabricks.net | |
root_path: /Shared/.bundle/${bundle.target}/${bundle.name} | |
run_as: | |
# This runs as training@dustinvannoy.com in production. Alternatively, | |
# a service principal could be used here using service_principal_name | |
# (see Databricks documentation). | |
user_name: training@dustinvannoy.com |
In the include section we pointed to a folder which has multiple .yml files that define resources that will be included when deploying the bundle. First is the a multi-step workflow named resources/datakickstart_dabs_jobs.yml.
# The main job for datakickstart_dabs | |
resources: | |
jobs: | |
datakickstart_dabs_job: | |
name: datakickstart_dabs_job_${bundle.target} | |
schedule: | |
quartz_cron_expression: '0 30 19 * * ?' | |
timezone_id: America/Los_Angeles | |
email_notifications: | |
on_failure: | |
– training@dustinvannoy.com | |
tasks: | |
– task_key: notebook_task | |
job_cluster_key: job_cluster | |
notebook_task: | |
notebook_path: ../src/notebook.ipynb | |
libraries: | |
– pypi: | |
package: pytest | |
max_retries: 0 | |
– task_key: refresh_pipeline | |
depends_on: | |
– task_key: notebook_task | |
pipeline_task: | |
pipeline_id: ${resources.pipelines.datakickstart_dabs_pipeline.id} | |
max_retries: 0 | |
– task_key: main_task | |
depends_on: | |
– task_key: refresh_pipeline | |
job_cluster_key: job_cluster | |
python_wheel_task: | |
package_name: datakickstart_dabs | |
entry_point: main | |
libraries: | |
– whl: ../dist/*.whl | |
max_retries: 0 | |
job_clusters: | |
– job_cluster_key: job_cluster | |
new_cluster: | |
spark_version: 13.3.x-scala2.12 | |
node_type_id: Standard_D3_v2 | |
autoscale: | |
min_workers: 1 | |
max_workers: 2 |
The second step of the workflow is a Delta Live Tables pipeline which is defined in file resources/datakickstart_dabs_pipeline.yml.
# The main pipeline for datakickstart_dabs | |
resources: | |
pipelines: | |
datakickstart_dabs_pipeline: | |
name: datakickstart_dabs_pipeline_${bundle.target} | |
target: datakickstart_dabs_${bundle.target} | |
libraries: | |
– notebook: | |
path: ../src/dlt_pipeline.ipynb | |
clusters: | |
– label: "default" | |
num_workers: 2 | |
configuration: | |
bundle.sourcePath: /Workspace/${workspace.file_path}/src |
Additional examples
Add libraries
tasks: | |
– task_key: notebook_task | |
job_cluster_key: job_cluster | |
notebook_task: | |
notebook_path: ../src/notebook.ipynb | |
libraries: | |
– pypi: | |
package: pytest | |
– pypi: | |
package: requests | |
– maven: | |
coordinates: com.azure:azure-messaging-eventhubs:5.16.0 | |
– whl: ../dist/*.whl |