This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction please see my first post on Databricks Asset Bundles. The github repository I use will probably be first to update with new examples, however I hope to continue to add to the examples in these posts plus additional videos.
Monorepo Setup
The simplest setup for managing your bundle and development workflow is to have a single bundle per repo and deploying all the code artifacts together. However, there are a few reasons why that isn’t always the best fit for an organization. If each project needs to be deployed separately you can have a separate repository for each and keep to one bundle per repository. However, that adds complexity when code needs to be re-used between projects since you now have to keep each repository on the correct version and coordinate deployments. The more small repositories you have the more overhead there is in switching from one to another, plus having multiple commits and PRs for a single feature update. By putting things into the same repository you often make the versioning and code reviews a bit simpler. However, you still may want some organization to separate the work owned by different teams and to have the option to deploy one project/folder at a time. In the monorepo setup, you would have a bundle definition (databricks.yml) in each project/folder within the repo. See this code repository which was used in the video for examples, specifically the simple_project and complex_project folders.
YAML Anchors for code re-use
One thing developers look for is the ability to modularize code for re-use so they can avoid typing it several times and having to change it in more than one place. There are a few ways to help with re-use. One is to use variables which can be modified once in the bundle configuration, set differently per target environment, or overridden during bundle deploy. For a more complex config, such as the job cluster definition, you can use YAML anchors. This will work only within the same YAML file so it may lead to grouping more resources together rather than having each in its own file in the included folder.
Here is an example which uses a YAML anchor to define job definition:
definitions:
job_clusters: &mycluster
- job_cluster_key: my_job_cluster
new_cluster:
spark_version: ${var.cluster_spark_version}
node_type_id: ${var.cluster_node_type}
autoscale:
min_workers: 1
max_workers: 3
tags_configuration: &tags_configuration
group: "group1"
product: "product1"
owner: "me"
environment: ${bundle.target}
resources:
jobs:
job1:
name: complex_proj_job1_${bundle.target}
tasks:
- task_key: notebook_task
job_cluster_key: my_job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
max_retries: 0
- task_key: notebook_task2
depends_on:
- task_key: notebook_task
job_cluster_key: my_job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
max_retries: 0
job_clusters: *mycluster
job2:
name: complex_proj_job2_${bundle.target}
tasks:
- task_key: job2_task
job_cluster_key: my_job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
max_retries: 0
job_clusters: *mycluster
job3:
name: complex_proj_job3_${bundle.target}
tasks:
- task_key: job3_task
job_cluster_key: my_new_job_cluster
notebook_task:
notebook_path: ../src/notebook.ipynb
max_retries: 0
job_clusters:
- job_cluster_key: my_new_job_cluster
new_cluster:
spark_version: ${var.cluster_spark_version}
node_type_id: ${var.cluster_node_type}
custom_tags:
<< : *tags_configuration
Shared Python Package (externally built wheel)
For shared library code that you import into notebooks or your main Python scripts, you often want to build it into a deployable package. Databricks Asset Bundles is able to build your Python package into a wheel and install it on the cluster when your job runs. But if the library needs to be used across multiple bundles it can be easier to just build the wheel yourself and upload it to a Databricks workspace folder or volume.
To build a wheel file, you can use a Python environment with wheel installed and run the following which uses the Databricks CLI to upload to the workspace:
python3 setup.py bdist_wheel
databricks workspace mkdirs /Shared/code
databricks workspace import --overwrite --format "AUTO" --file dist/datakickstart_dabs-0.0.1.20240319.2-py3-none-any.whl /Shared/code/datakickstart_dabs-0.0.1-py3-none-any.whl
When referencing a package that is stored in your workspace, you just specify the full path.
resources:
jobs:
datakickstart_shared_lib_job:
name: datakickstart_shared_lib_job_${bundle.target}
tasks:
- task_key: shared_lib_task
job_cluster_key: job_cluster
python_wheel_task:
package_name: datakickstart_dabs
entry_point: main
libraries:
- whl: /Workspace/Shared/code/datakickstart_dabs-0.0.1-py3-none-any.whl
max_retries: 0
Serverless cluster configs
Serverless compute for jobs is now an option within Databricks. This can be a good way to avoid decisions about node type, min and max worker count, and Spark configurations so that you can just let Databricks figure it out for you. It also has enhanced autoscaling which should allow for it to more aggressively and intelligently scale for your workload.
resources:
jobs:
datakickstart_serverless_shared_lib:
name: "datakickstart_shared_lib_serverless_${bundle.target}"
tasks:
- task_key: shared_lib_task
python_wheel_task:
package_name: datakickstart_dabs
entry_point: main
environment_key: serverless_demo_env
tags:
dev: training
environments:
- environment_key: serverless_demo_env
spec:
client: "1"
dependencies:
- /Workspace/Shared/code/datakickstart_dabs-0.0.1-py3-none-any.whl
- pytest
datakickstart_serverless_notebook:
name: Serverless_notebook_${bundle.target}
tasks:
- task_key: serverless_notebook_task1
notebook_task:
notebook_path: ../src/dbconnect_examples_standalone.ipynb
source: WORKSPACE
queue:
enabled: true
Conclusion
Databricks Asset Bundles help you follow good software engineering practices and easily deploy your code across isolated environments. It is ready for use in your production Databricks automated deployments. Hopefully these examples (and others included in the video) set you up to get your DABs working with your specific needs.
