Data engineers and data scientists benefit from using best practices learned from years of software development. These concepts should be applied by data teams regardless of which platform you use. The videos and write-up below are meant to provide value to any data professional, but the focus of the videos is how to implement on Databricks. Why am I focused on Databricks? Because I have experienced building this out in many different platform providers and found that Databricks offers a lot of valuable integrations that can make following best practices easier. For a more SQL based ELT approach I have some other recommendations, with or without Databricks as your platform, which I will try to create a video for in the future.
In the first video I share why developer experience and best practices are important and why I think Databricks offers the best developer experience for a data platform. I cover high level developer lifecycle and 7 ways to improve your team’s development process with a goal of better quality and reliability.
The next video walks through 3 of the most important practices to build quality analytics solutions. It is meant to be an overview of what following these practices looks like for a Databricks developer.
This video covers:
– Version control basics and demo of Git integration with Databricks workspace
– Automated tests with pytest for unit testing and Databricks Workflows for integration testing
– CI/CD (including running tests) with GitHub Actions
Why development process matters
When building any type of applications, whether traditional software or data applications, you follow a development process. Another term used for this is software development life cycle (SDLC). It’s important to think about this process and how to streamline each stage. An efficient process leads to creating high-quality data projects, even in modern environments like Databricks. Here’s a simplified overview of the development process and key best practices for successful projects.
The development process typically involves these stages:
- Planning
- Development
- Testing
- Release

However, in reality, this process is often more iterative. When testing reveals issues, developers need to return to the development stage. Even after release, user feedback can lead to revisiting the planning stage for improvements.
Implementing Best Practices in Databricks
Often the question faced by development teams who want to improve their process is, “Where do we start?” Below is my list of 7 recommended best practices in order of priority. Smaller teams often will stop at a certain point, such as item 3 or 4. The more the organization values stability and high quality deliverables, the more effort they will put into automating these types of practices. The trade off is that it takes extra effort which will slow down building new features initially. However, given some time I think these practices will actually help you consistently deliver new features quickly.
| Recommendation | Description |
|---|---|
| Implement Version Control | Ensure the correct code version is in production, track changes over time, and easily roll back when needed. |
| Run Automated Code Tests | Implement automated testing for code (including PySpark in notebooks and SQL) to catch issues early in the development process. |
| Deploy to Isolated Environments | Use separate environments to test and ensure high-quality, working systems before deploying to production. |
| Run System Health Checks | Implement automated monitoring to check that the deployed system is functioning correctly in production. |
| Perform Data Quality Tests | Verify that the data produced in production meets expectations and quality standards. |
| Automate Data Schema Changes | Streamline the process of modifying table structures and deploying schema changes. |
| Implement Automated Rollback | Develop a rollback plan that will allow you to quickly revert to a previous working version when significant issues are detected after deployment. |
Resource links
Repository used for demo – datakickstart/flights-e2e-azure
youtube.com/DustinVannoy – CICD Playlist
Best Practices for Unit Testing PySpark
Testing and DevOps Best Practices for Delta Live Tables
Develop and Deploy Code Easily With IDEs
How to Get the Most Out of Databricks Notebooks
Databricks Asset Bundles: A Unifying Tool for Deployment on Databricks
