Databricks Asset Bundles: Advanced Examples

This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction please see my first post on Databricks Asset Bundles. The github repository I use will probably be first to update with new examples, however I hope to continue to add to the examples in these posts plus additional videos.

Monorepo Setup

The simplest setup for managing your bundle and development workflow is to have a single bundle per repo and deploying all the code artifacts together. However, there are a few reasons why that isn’t always the best fit for an organization. If each project needs to be deployed separately you can have a separate repository for each and keep to one bundle per repository. However, that adds complexity when code needs to be re-used between projects since you now have to keep each repository on the correct version and coordinate deployments. The more small repositories you have the more overhead there is in switching from one to another, plus having multiple commits and PRs for a single feature update. By putting things into the same repository you often make the versioning and code reviews a bit simpler. However, you still may want some organization to separate the work owned by different teams and to have the option to deploy one project/folder at a time. In the monorepo setup, you would have a bundle definition (databricks.yml) in each project/folder within the repo. See this code repository which was used in the video for examples, specifically the simple_project and complex_project folders.

YAML Anchors for code re-use

One thing developers look for is the ability to modularize code for re-use so they can avoid typing it several times and having to change it in more than one place. There are a few ways to help with re-use. One is to use variables which can be modified once in the bundle configuration, set differently per target environment, or overridden during bundle deploy. For a more complex config, such as the job cluster definition, you can use YAML anchors. This will work only within the same YAML file so it may lead to grouping more resources together rather than having each in its own file in the included folder.

Here is an example which uses a YAML anchor to define job definition:

definitions:
  job_clusters: &mycluster
    - job_cluster_key: my_job_cluster
      new_cluster:
        spark_version: ${var.cluster_spark_version}
        node_type_id: ${var.cluster_node_type}
        autoscale:
            min_workers: 1
            max_workers: 3
  tags_configuration: &tags_configuration
    group: "group1"
    product: "product1"
    owner: "me"
    environment: ${bundle.target}

resources:
  jobs:
    job1:
      name: complex_proj_job1_${bundle.target}
      tasks:
        - task_key: notebook_task
          job_cluster_key: my_job_cluster
          notebook_task:
            notebook_path: ../src/notebook.ipynb
          max_retries: 0
        
        - task_key: notebook_task2
          depends_on:
            - task_key: notebook_task
          job_cluster_key: my_job_cluster
          notebook_task:
            notebook_path: ../src/notebook.ipynb
          max_retries: 0
      job_clusters: *mycluster
    
    job2:
      name: complex_proj_job2_${bundle.target}
      tasks:
        - task_key: job2_task
          job_cluster_key: my_job_cluster
          notebook_task:
            notebook_path: ../src/notebook.ipynb
          max_retries: 0

      job_clusters: *mycluster

    job3:
      name: complex_proj_job3_${bundle.target}
      tasks:
        - task_key: job3_task
          job_cluster_key: my_new_job_cluster
          notebook_task:
            notebook_path: ../src/notebook.ipynb
          max_retries: 0

      job_clusters: 
        - job_cluster_key: my_new_job_cluster
          new_cluster:
              spark_version: ${var.cluster_spark_version}
              node_type_id: ${var.cluster_node_type}
              custom_tags:
                << : *tags_configuration

Shared Python Package (externally built wheel)

For shared library code that you import into notebooks or your main Python scripts, you often want to build it into a deployable package. Databricks Asset Bundles is able to build your Python package into a wheel and install it on the cluster when your job runs. But if the library needs to be used across multiple bundles it can be easier to just build the wheel yourself and upload it to a Databricks workspace folder or volume.

To build a wheel file, you can use a Python environment with wheel installed and run the following which uses the Databricks CLI to upload to the workspace:

python3 setup.py bdist_wheel
databricks workspace mkdirs /Shared/code
databricks workspace import --overwrite --format "AUTO" --file dist/datakickstart_dabs-0.0.1.20240319.2-py3-none-any.whl /Shared/code/datakickstart_dabs-0.0.1-py3-none-any.whl

When referencing a package that is stored in your workspace, you just specify the full path.

resources:
  jobs:
    datakickstart_shared_lib_job:
      name: datakickstart_shared_lib_job_${bundle.target}

      tasks:
        - task_key: shared_lib_task
          job_cluster_key: job_cluster
          python_wheel_task:
            package_name: datakickstart_dabs
            entry_point: main
          libraries:
            - whl: /Workspace/Shared/code/datakickstart_dabs-0.0.1-py3-none-any.whl 
          max_retries: 0

Serverless cluster configs

Serverless compute for jobs is now an option within Databricks. This can be a good way to avoid decisions about node type, min and max worker count, and Spark configurations so that you can just let Databricks figure it out for you. It also has enhanced autoscaling which should allow for it to more aggressively and intelligently scale for your workload.

resources:
  jobs:
     datakickstart_serverless_shared_lib:
      name: "datakickstart_shared_lib_serverless_${bundle.target}"
      tasks:
        - task_key: shared_lib_task
          python_wheel_task:
            package_name: datakickstart_dabs
            entry_point: main
          environment_key: serverless_demo_env
      tags:
        dev: training
      environments:
        - environment_key: serverless_demo_env
          spec:
            client: "1"
            dependencies:
              - /Workspace/Shared/code/datakickstart_dabs-0.0.1-py3-none-any.whl
              - pytest

    datakickstart_serverless_notebook:
      name: Serverless_notebook_${bundle.target}
      tasks:
        - task_key: serverless_notebook_task1
          notebook_task:
            notebook_path: ../src/dbconnect_examples_standalone.ipynb
            source: WORKSPACE
      queue:
        enabled: true

Conclusion

Databricks Asset Bundles help you follow good software engineering practices and easily deploy your code across isolated environments. It is ready for use in your production Databricks automated deployments. Hopefully these examples (and others included in the video) set you up to get your DABs working with your specific needs.

DUSTIN VANNOY

Databricks Asset Bundles: Advanced Examples

Monorepo Setup

YAML Anchors for code re-use

Shared Python Package (externally built wheel)

Serverless cluster configs

Conclusion

Like this:

2 Comments

Leave a ReplyCancel reply

About

Featured Posts

Claude Code Essentials for Data Professionals

Cursor with Databricks: AI Enhanced Development

OSS Spotlight: Unity Catalog

Essential Best Practices for Data Engineers on Databricks

PASS 2024 – Databricks Resources for DevX and CICD

Databricks Asset Bundles: Advanced Examples

Monorepo Setup

YAML Anchors for code re-use

Shared Python Package (externally built wheel)

Serverless cluster configs

Conclusion

Share this:

Like this:

Leave a ReplyCancel reply

About

Stay informed

Featured Posts

Discover more from DUSTIN VANNOY