Let’s walk through the fundamentals of using Kusto Query Language (KQL) to query your logs in Azure Log Analytics. Check out the video to see it in action and keep reading for more code examples and written steps to run queries. This covers a few basics as well as a complex query used to parse JSON when monitoring Spark Structured Streaming.
Azure Monitor is a suite of tools in Azure to cover your monitoring needs. One facet of this is Log Analytics — a place to collect and query logs with a SQL-like language called Kusto Query Language (KQL). I use this mostly with my Spark logs from Azure Databricks but these concepts can be applied to other types of logs as well.
Getting Started in the UI
To start, go to your Log Analytics Workspace then select Logs from the left pane. Then you can browse which tables are available and select one to start a new query. Since this tutorial is part of my Spark Monitoring series I will demonstrate commands with the custom logs from Azure Databricks. If you want to configure this in your environment, check out my video Azure Databricks Monitoring with Log Analytics which explains more and includes steps to set it up.
Basics of KQL
You will start your query with the log name on the first line. For this example use SparkLoggingEvent_CL which contains all the Log4j output from the Databricks cluster. Then you can add a project statement that indicates which field names you want to use. This is not required but will save you time scrolling through all the columns in the log which are not helpful in the query result. The | (pipe) character is used to separate parts of your KQL statement.
SparkLoggingEvent_CL | project TimeGenerated, clusterId_s, Message, Level, logger_name_s, clusterNames_s
The next important part of a KQL statement is the filter, which you add using the where clause. You can filter using simple conditions such as equals or with more complex operations like has.
SparkLoggingEvent_CL | where Level == "ERROR" or logger_name_s == 'pyspark_logger' | project TimeGenerated, clusterId_s, Message, Level, logger_name_s, clusterNames_s
Another common feature of KQL is the order by clause for sorting the output.
SparkLoggingEvent_CL | where Level == "ERROR" or logger_name_s == 'pyspark_logger' | project TimeGenerated, clusterId_s, Message, Level, logger_name_s, clusterNames_s | order by TimeGenerated
This covers some of the basics that you will use. Keep reading for a more advanced example for which I use when monitoring Spark Structured Streaming.
Logging to Log Analytics from PySpark
To write your own application log messages when using PySpark you need to retrieve the Log4j log manager instead of using the normal python logging library. You can do that with the code example below.
For a more complex example of how you may use the Log4j logger, the example below includes custom error logging in the except clause of your Python code. One of the few times that catching the base exception makes sense is when you want to be sure to log the errors to Log Analytics.
Advanced Example: Parse Query Progress Event
KQL has many functions and capabilities to support complex use cases. One example that is helpful for monitoring Apache Spark is to parse the Structured Streaming events for the Query Progress details. This requires using two functions on the Message field: replace and parse_json. Add a new column to your query with the following syntax.
MessageJson=parse_json(replace('Streaming query made progress: ', '', Message))
To filter down to only the relevant messages you also add a where clause.
| where Message has "Streaming query" and logger_name_s == "org.apache.spark.sql.execution.streaming.MicroBatchExecution"
Then you can select certain attributes from the JSON object by adding another project statement.
| project TimeGenerated, MessageJson.batchId, MessageJson.numInputRows, MessageJson.inputRowsPerSecond, MessageJson.processedRowsPerSecond
The complete query to view these results would be:
SparkLoggingEvent_CL | project TimeGenerated, clusterId_s, MessageJson=parse_json(replace('Streaming query made progress: ', '', Message)), Message, Level, logger_name_s, clusterName_s | where Message has "Streaming query" and logger_name_s == "org.apache.spark.sql.execution.streaming.MicroBatchExecution" | project TimeGenerated, MessageJson.batchId, MessageJson.numInputRows, MessageJson.inputRowsPerSecond, MessageJson.processedRowsPerSecond | order by TimeGenerated
Alerts on Log Analytics Queries
Log analytics queries and KQL are useful for more than just on-demand queries of your logs. You can setup alerts from your query by selecting +New alert rule.
Within the alert page you can select the condition and fine tune your query and settings.
You can choose for the alert to be Based on either Number of results or Metric measurement. The Number of results can be Greater than, Less than, Equal to, or similar option. In the Threshold value you choose what number to use in the condition. In this example you can set Number of results to be Greater than 10.
In addition the Evaluate based on section allows you to configure the Period (time window for the query) and the Frequency (how often to check if the threshold is met).
The simplest action to configure is email, but you can add other Actions by selecting Add action group. See the image below for which actions are available.
Finally you alert details then select Create alert rule.
This post and video includes many of the components that I use frequently as a Data Engineer. I recommend starting with a question about your application or what is included in the logs then trying to build your own query from there. When you need to find new functions or commands to answer your question, the KQL reference guide is a great resource. If you want to go through more KQL commands I recommend the KQL Tutorial as the next step.