Databricks Monitoring with System Tables

Monitoring is important, so I’ve covered the topic a few times in the past. I’ve talked about collecting your Spark application logs and Spark metrics. These are a good way to track what is happening and what is going wrong as your code runs. In the video related to this post I focus on a different side of monitoring. The evolving capabilities offered by Databricks System Tables. I have some sample queries and links to help you get started and begin to get value from system tables. This will need to be updated (I’ll try) as new tables go into public preview status. So let’s discuss the questions I had when I first started researching this feature:
1) What do the Databricks system tables offer me for monitoring?
2) How much does this overlap with the application logs and metrics?

Enabling system tables

This topic has been covered well in a few places including other YouTube channels. I suggest you take a look at which of these two options seems easiest to you and then follow the instructions. If you get stuck, search for a video that walks you through it.

Option 1 – use HTTP calls via curl command. You will need to get your Databricks personal access token and find your metastore id.

Option 2 (recommended) – use dbdemos which comes with an enablement notebook plus sample queries and a dashboard. Just make sure to run the notebook within a UC enabled workspace and it will set tables up with the default metastore. The exact notebook to enable the system tables is _enable_system_tables.

What do the Databricks system tables offer me for monitoring?

The detailed application logs and Spark metrics are not covered by system tables, but other useful details are made available. Two really important monitoring capabilities are usage costs and access auditing. The system tables provide an easy way to query just by turning on a small serverless warehouse.

Here is my example query for viewing usage and the costs (based on list price).

With usage_costs (
select
-- u.account_id,
u.workspace_id,
u.sku_name,
u.usage_start_time,
u.usage_end_time,
u.usage_date,
date_format(u.usage_date, 'yyyy-MM') as YearMonth,
u.usage_unit,
u.usage_quantity,
lp.pricing.default as list_price,
lp.pricing.default * u.usage_quantity as list_cost,
case when u.usage_metadata.job_id is not null then 'job'
when u.usage_metadata.dlt_pipeline_id is not null then 'dlt'
when u.usage_metadata.warehouse_id is not null then 'sql'
when u.usage_metadata.notebook_id is not null then 'notebook'
else 'other' end as usage_type,
coalesce(u.usage_metadata.job_id, u.usage_metadata.dlt_pipeline_id, u.usage_metadata.warehouse_id, u.usage_metadata.notebook_id) as resource_id,
u.usage_metadata.*
from
system.billing.usage u
inner join system.billing.list_prices lp on u.cloud = lp.cloud and
u.sku_name = lp.sku_name and
u.usage_start_time >= lp.price_start_time and
(u.usage_end_time <= lp.price_end_time or lp.price_end_time is null)

where
usage_start_time >= '2024-02-01'
)

select usage_type, resource_id, sku_name, cast(usage_date as date), sum(usage_quantity) as quantity, sum(list_cost) as cost
from usage_costs
group by all
order by cost desc
limit 20

To read more and find additional examples check out this blog by Ryan Chynoweth.

When it comes to the access table within the audit schema, many details are exposed which line up with what you would collect if you enable Diagnostic Logging. I haven’t done a full comparison of the two and it’s important to note that system tables aren’t updated real time. Below are my example queries for reading some key events from the access table. I encourage you to check out the list of events in the docs and explore which ones are most important to you.

select action_name, event_time, session_id, user_identity.email, service_name, request_params, response, event_id
from system.access.audit
where action_name in ('executeAdhocQuery', 'submitRun'
,'create', 'createTable', 'delete', 'update', 'edit', 'attachNotebook', 'detachNotebook', 'fileCreate', 'startUpdate'
,'createResult', 'createQuery', 'runStart', 'executeSavedQuery', 'runSucceeded', 'updateQuery', 'runTriggered', 'runNow', 'createNotebook', 'restartResult', 'startResult'
)
and event_time between '2024-02-03' and '2024-02-05'
order by event_time desc
limit 40;


select action_name, user_identity.email, service_name, response.status_code
, max(event_time) last_event, count(1) as row_count
from system.access.audit
where action_name in ('executeAdhocQuery', 'submitRun', 'create', 'createTable', 'delete', 'update', 'edit', 'attachNotebook', 'detachNotebook', 'fileCreate', 'startUpdate'
,'createResult', 'createQuery', 'runStart', 'executeSavedQuery', 'runSucceeded', 'updateQuery', 'runTriggered', 'runNow', 'createNotebook', 'restartResult', 'startResult'
)
group by all
order by row_count desc
limit 40;

For a detailed blog about how this can help from a security monitoring perspective, check out this Databricks blog – Improve Lakehouse Security Monitoring using System Tables in Databricks Unity Catalog.

How much does this overlap with the application logs and metrics?

There is a high level of detail in application logs and the Spark metrics, so there are good reasons why these tables don’t collect all of that same information. On the Spark metrics side, which is showing cluster resource usage and Spark application metrics, that doesn’t appear to be covered by system tables right now. When it comes to where the Log4J output is useful, I think some use cases can be replaced by system tables as long as you don’t need answers near real-time. If you are wanting to track job runs, the audit table actually has many events around triggered jobs. In addition, the billing usage table tracked information around Job Id, Pipeline Id, and related costs. However, it may be hard to piece together which jobs failed, what error messages were raised, and the timing of the various tasks involved. While if you add custom log messages into your Spark notebooks, then you can track exactly what you need and aggregate it later.

Summary

You can expect more system tables in the future so keep an eye out for what else becomes available. There is a lot that can be monitored today, but I expect even more capabilities in the future. Of course you can turn on the option to save cluster logs to dbfs or add an integration like Log Analytics to track even more information about your Spark clusters.

1 Comment

Leave a Reply