Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.
Databricks Asset Bundles full example
A full repo with examples is available here: https://github.com/datakickstart/datakickstart_dabs
After you read (or watch) the intro material, go check out my advanced Databricks Asset Bundles post for more patterns and examples.
Let’s start by looking at the bundle file which defines some base settings, a few target environments, and specifies other resources to include.
| # yaml-language-server: $schema=bundle_config_schema.json | |
| # This is a Databricks asset bundle definition for datakickstart_dabs. | |
| # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation. | |
| bundle: | |
| name: datakickstart_dabs | |
| include: | |
| – resources/*.yml | |
| targets: | |
| dev: | |
| # We use 'mode: development' to make sure everything deployed to this target gets a prefix | |
| # like '[dev my_user_name]'. Setting this mode also disables any schedules and | |
| # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. | |
| mode: development | |
| default: true | |
| workspace: | |
| host: https://adb-7923111111111114.14.azuredatabricks.net | |
| staging: | |
| # For staging deployments, we only have a single copy, so we override the | |
| # workspace.root_path default of | |
| # /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name} | |
| # to a path that is not specific to the current user. | |
| mode: production | |
| workspace: | |
| host: https://adb-7923111111111114.14.azuredatabricks.net | |
| root_path: /Shared/.bundle/${bundle.target}/${bundle.name} | |
| run_as: | |
| user_name: training@dustinvannoy.com | |
| # The 'prod' target, used for production deployment. | |
| prod: | |
| # For production deployments, we only have a single copy, so we override the | |
| # workspace.root_path default of | |
| # /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name} | |
| # to a path that is not specific to the current user. | |
| mode: production | |
| workspace: | |
| host: https://adb-7923111111111114.14.azuredatabricks.net | |
| root_path: /Shared/.bundle/${bundle.target}/${bundle.name} | |
| run_as: | |
| # This runs as training@dustinvannoy.com in production. Alternatively, | |
| # a service principal could be used here using service_principal_name | |
| # (see Databricks documentation). | |
| user_name: training@dustinvannoy.com |
In the include section we pointed to a folder which has multiple .yml files that define resources that will be included when deploying the bundle. First is the a multi-step workflow named resources/datakickstart_dabs_jobs.yml.
| # The main job for datakickstart_dabs | |
| resources: | |
| jobs: | |
| datakickstart_dabs_job: | |
| name: datakickstart_dabs_job_${bundle.target} | |
| schedule: | |
| quartz_cron_expression: '0 30 19 * * ?' | |
| timezone_id: America/Los_Angeles | |
| email_notifications: | |
| on_failure: | |
| – training@dustinvannoy.com | |
| tasks: | |
| – task_key: notebook_task | |
| job_cluster_key: job_cluster | |
| notebook_task: | |
| notebook_path: ../src/notebook.ipynb | |
| libraries: | |
| – pypi: | |
| package: pytest | |
| max_retries: 0 | |
| – task_key: refresh_pipeline | |
| depends_on: | |
| – task_key: notebook_task | |
| pipeline_task: | |
| pipeline_id: ${resources.pipelines.datakickstart_dabs_pipeline.id} | |
| max_retries: 0 | |
| – task_key: main_task | |
| depends_on: | |
| – task_key: refresh_pipeline | |
| job_cluster_key: job_cluster | |
| python_wheel_task: | |
| package_name: datakickstart_dabs | |
| entry_point: main | |
| libraries: | |
| – whl: ../dist/*.whl | |
| max_retries: 0 | |
| job_clusters: | |
| – job_cluster_key: job_cluster | |
| new_cluster: | |
| spark_version: 13.3.x-scala2.12 | |
| node_type_id: Standard_D3_v2 | |
| autoscale: | |
| min_workers: 1 | |
| max_workers: 2 |
The second step of the workflow is a Delta Live Tables pipeline which is defined in file resources/datakickstart_dabs_pipeline.yml.
| # The main pipeline for datakickstart_dabs | |
| resources: | |
| pipelines: | |
| datakickstart_dabs_pipeline: | |
| name: datakickstart_dabs_pipeline_${bundle.target} | |
| target: datakickstart_dabs_${bundle.target} | |
| libraries: | |
| – notebook: | |
| path: ../src/dlt_pipeline.ipynb | |
| clusters: | |
| – label: "default" | |
| num_workers: 2 | |
| configuration: | |
| bundle.sourcePath: /Workspace/${workspace.file_path}/src |
Additional examples
Add libraries
| tasks: | |
| – task_key: notebook_task | |
| job_cluster_key: job_cluster | |
| notebook_task: | |
| notebook_path: ../src/notebook.ipynb | |
| libraries: | |
| – pypi: | |
| package: pytest | |
| – pypi: | |
| package: requests | |
| – maven: | |
| coordinates: com.azure:azure-messaging-eventhubs:5.16.0 | |
| – whl: ../dist/*.whl |
