Data + AI Summit 2023 has just completed with many announcements and deep dives. I attended virtually this year but was just as excited as the in-person attendees for some of the new capabilities that were shared. After watching the keynote presentations and tracking additional posts about new features, I want to summarize the top takeaways from my point of view as a Data Engineer.
Disclaimer: I am employed by Databricks, but any opinions in this post are my own. I am summarizing a portion of the talks and will include references to the real announcement or related details.
Let’s get into it and keep this short(ish)!
I created a video to cover this also if you prefer that: https://youtube.com/watch?v=32qa64reZ0I
With Databricks Unity Catalog, you will be able to manage and work with data from your other data warehouse systems right from Databricks Lakehouse. Basically, you connect data sources like Snowflake or SQL Server, and those sources just show up as a catalog. You can add more metadata and these DWs will show up within lineage. You can even set access controls for those accessing your external DWs from Unity Catalog.
This is a big deal because now you can explore data across different warehouses or relational databases from within Databricks SQL. This should enable some of the data exploration use cases and maybe even basic reporting across systems without having to import data. When it comes time to import data to your Lakehouse, this may be a way to easily connect that data. I still need to explore this before I can say for sure if this would be good for automated jobs, but I am very excited about how it helps with ad-hoc analysis and exploratory data analysis.
This one sounds like it’s only meant for AI/ML engineers or the business users, but our worlds are colliding. I think this is important for how Data Engineers will enable the users of their data platforms We’ve been in a world where people learn a development language in order to build data tools and dashboards, usually by writing Python or SQL. With Large Language Models (LLMs) we can enable these capabilities for the whole enterprise. Many attempts at this have not worked because they lack context to get you to the right answers. With Lakehouse IQ, Databricks has a knowledge engine that will drive various automated assistant capabilities. If you think of GitHub Co-pilot, or even Chat GPT, you want to get help in similar ways but customized to understand your own metadata, jargon, and organization structure. If you hear the Microsoft Build keynotes, they talked a lot about AI tools “grounded” with different context and capabilities. So in those terms, Lakehouse IQ allows for assistants that are grounded in your organizations data via Unity Catalog when working within the Databricks Lakehouse.
But wait there’s more…this is the really exciting part to me. This is going to be available via API so you can use it from your own apps. If you are building your own apps with langchain you can use a DatabricksAgent as part of your application to get more accuracy when asking questions about your organizations data. See the part of the talk by Matei Zaharia and the demo by Weston Hutchins to catch up on these capabilities.
Partitioning is a very important part of designing your data lake or lakehouse. You decide one value or a few related values which end up grouping your data. This happens by creating a subfolder for each value, so once you write the data it can be a pain to change your mind (since that causes a rewrite).
Liquid clustering replaces partitioning with a less rigid option. It will group data together on cluster fields and intelligently determine file size based on actual data. This means the typical issues with data skew aren’t an issue like they are with partitioning. It will allow for faster writes, self-tuning to avoid over and under partitioning, and ability to do partial clustering of new data. Databricks announced it is committed to adding this to OSS Delta Lake (but post Delta 3.0 release).
Why is this exciting? Because partitioning is often a scary decision and can have unintended consequences if you choose poorly. This is meant to resolve that issue while also being more efficient and performant.
So let’s be real, some data engineers will love this one and others will care less. If you have been all in on Delta, you should probably just skip to the next section. The quick background is that Parquet is a powerful file format for storing data in a portable way. Much better than a format like CSV or JSON. However, it had it’s downsides: frequent rewrites of data, no isolation when multiple jobs write, etc.
The limitation of Parquet have been solved in different ways by different new formats, most notably Delta Lake, Iceberg, and Hudi. Each of these adds metadata with underlying Parquet files in order to offer new benefits. Different data systems have chosen different formats to support well. This leaves data engineers trying to decide which format to use and at time writing to more than one format just to meet all the downstream use cases.
Databricks announced UniForm, an option write data as Parquet with metadata for Delta, Iceberg, and Hudi. Now you can write with the Delta Lake format and still be able to use it from other systems that only support Iceberg or Hudi. Super exciting if you need to support multiple formats!
Check out the demo and documentation for how to use it. Looks like when working with Databricks Runtime 13.2 you can use command TBLPROPERTIES(‘delta.universalformat.enabledFormats’ = ‘iceberg’).
Databricks SQL Performance Improvements
Reynold told a great story in his keynote, so let’s just make a long story short (and include a link to the long one).
Predictive IO uses AI to learn about workloads to give better read performance. According to Reynold’s slides, it can “Triangulate where your data is without having to do manual tuning”. There is also a capability called Deletion Vectors which speeds up deletes and updates by not having to rewrite the full file for every change.
Predictive Optimization will automatically set your data layout by choosing file size, clustering, and running OPTIMIZE, VACUUM, ANALYZE, and CLUSTERING commands for you.
Intelligent Workload Management is a capability where AI models will continuously learn from your workload history to better know whether to spin up additional compute or run a query immediately based on predicted workload size and timing.
Why is this exciting? We get better performance for BI queries like what Power BI and Tableau will use when connecting to Databricks. We get that performance with less manual tuning. And if we can make Databricks SQL Warehouse fast enough (which is actually being done by real customers), then we don’t have to copy data. I think every data engineer would prefer to make fewer copies of the data. If you don’t feel this way, we need to have a chat.
While this isn’t my specialty (yet), the capabilities announced to help on AI/ML workloads are going to be helpful for a lot of teams out there so I’ll give them a quick mention.
Vector Search is a way to stored embedded documents to easily find similar docs when you get new input to an LLM based application. So let’s pretend you have an application leverage an LLM to help with query tuning in your system. If someone asks “How do I improve performance for my last query?”, you can look for similar questions and related answers to hone in on what to recommend.
In addition, the Feature Serving capability would be storing all the latest features which can be looked up in real-time as part of adding more context. In this case it needs to know what is the last query for this user, plus probably some features that capture metrics about longest running operations in the query and maybe configuration of the tables involved.
Now how does all this get used? A common approach is to use LangChain, a popular open source library to tie together everything need to make an LLM driven application work well. And by work well, I mean with proper context and up-to-date information.
Databricks also announced OSS Model Library to provide a set of open source models, optimized for model serving within Databricks. Looking forward, there will be more features to customize these OSS models much easier than you can today.
An exciting announcement on the AI/ML side is that Databricks announced MLflow Evaluation. This let’s you do A/B testing on models and easily evaluate which is performing better.
In addition, Databricks is adding MLflow AI Gateway, which is a way to setup access with credentials for accessing models from Databricks, including rate limiting.
Go watch the demo by Kasey Uhlenuth to get a clear picture of how many of the newer AI/ML capabilities can fit together for a use case: Vector Search and Model Evaluation demo
Spark Improvements for Python
The final thing I’ll share was that more Python support is expected, including being able to extend Spark using Python, such as writing custom sources. This is actually quite nice. I don’t mind writing Scala code, but having to package it up and set a CI/CD process is not a fun use of time if all your other work is in Python.
A bit more flashy of an announcement is the English SDK for Apache Spark! I have no idea if I will be using this to write production code in the future, but it is impressive. You can use pyspark_ai library to call df.ai.transform() with string of plain English saying what you would like to do. You can also plot with English using df.ai.plot(). Several other types of operations are supported also. Check it out at http://pyspark.ai.
A lot more things were shared at the conference which will be exciting for some and not others. As sessions get released on the Databricks YouTube channel I encourage you to take a look. I have some ideas of topics I can make videos about as more things become public preview or GA, so stay tuned and feel free to comment on what interests you the most.