Intro to Azure Stream Analytics

Why Azure Stream Analytics?

Real-time data processing is becoming more common in companies of all sizes. The use cases range from simple stream ingestion to complex machine learning pipelines. If you need to get started with streaming in Azure, Stream Analytics gives you a simple way to get up and running. Most of my streaming projects involve Apache Kafka and Spark which can take a lot of setup (or at least involving additional vendors to simplify the experience). Those technologies are great especially for challenging streaming pipelines, but if your data platform is within Azure you should consider if Stream Analytics will meet your needs.

Fully managed streaming

A Stream Analytics Job can be created easily through the UI or by using the APIs. You do not need to provision a cluster and make too many configuration decisions just to run a stream processing job. A Stream Analytics Job is fully managed and highly reliable. You simply choose the Streaming Units to set the amount of CPU and memory to be used by your job.

SQL-like language

If you are familiar with SQL you can jump into creating the core of a Stream Analytics Job without much extra effort. You will need to have configured an Input which you specify in the FROM clause. You will also want an Ouptut which you specify in the INTO clause.

    AVG(temp) avgTemp
INTO [datalakeoutput]
FROM [stream-demo-1]
    TumblingWindow(Duration(minute, 2))
HAVING avgTemp > 26

With streaming aggregations it’s typical to specify a window to limit the set of data used in the calculation. The basic idea is represented well by a Tumbling Window. In a Tumbling Window, you specify a duration and at the end of the period of time the calculation will happen for any events in that segment. For example, at the end of a 2 minute window the average would be taken for only events within the 2 minutes and the results is output. Then after another two minutes the average would be output for the next 2 minute segment without any overlap with the prior window. Often with streaming calculations you want a rolling aggregate so you would choose an alternate window type such as a hopping window. When you get started with streaming aggregations I recommend to read up on all the windowing options so you can make the right decision.

Easy to scale

While Streaming Units are a bit abstract, they also make scaling simple. Start small then increase as needed. Though it’s not a simple check box, there is an option to autoscale using Azure Automation.

When is it a good choice?

Source is Event Hub, IoT Hub, or Azure Storage

Azure Stream Analytics has limited source options, but the sources are the most common ways to natively stream data in Azure. When you have data flowing into an Event Hub or IoT Hub and you need to go to a supported output, Stream Analytics gives you an easy option.

List of inputs: Event Hub, IoT Hub, Blog Storage / ADLS Gen2

What do you do if you want to stream data from a SQL Server database? You would need a process that sends the new and changed events to Event Hubs (or perhaps Azure Storage). This should be possible with existing tools like Apache Kafka Connect.

Basic transformations

Stream Analytics does have some powerful functions, but since the job is going to be written entirely in SQL I prefer it for fairly basic transformations. If multiple steps are needed, I recommend separate Stream Analytics jobs with Event Hubs used to pass the data through. The main reason for that is often once you have done a major step that adds significant value to the data, some other job or service will want to consume the results of that step before additional transformation happens. When you need to join in additional data, the reference input is an option but only supports a database or storage.

Or integrated function

However, it can handle some advanced use cases with its custom transformations or outputs. There are some powerful options to add your own functions to the mix. I have not explored these enough to provide guidance, but just want to call out that there are options to integrate more complex processing logic.

Supported functions: Azure ML Service, Javascript UDF, Javascript UDA, Azure ML Studio

Output is supported

Perhaps it is obvious, but you want the destination to be supported. If you are focused on building in Azure, this list of outputs is actually pretty good. Ability to call an Azure Function is especially interesting for more complicated use cases.

List of supported outputs

Power BI real-time dashboard

The ability to stream to Power BI is a very unique thing that Stream Analytics enables. Check out this video by Curbal to see it in action and quick tips on the limitations that can be found in the docs.

What’s the catch?

So what is the downside of Azure Stream Analytics? Why am I not throwing Spark Structured Streaming out the door and only using this?

Personally I need to work with it more to determine if I like developing these types of solutions. My experience is that the simple things work fine but if joins or setup is not right it is difficult to understand the cause of the problem. The development environment on the Azure Portal doesn’t seem to support all the auto-complete capabilities you would want if working in this every day. You can develop with Visual Studio instead though. Others may have difference experiences, so let me call out two less subjective points.

Limited sources

The main downside I see is the limited sources and lack of ability to join streaming datasets together. If you don’t have other reasons to use one of the supported sources, I would probably look at other options.

Azure only

Related to limited sources, this is built to run in Azure and does not integrate directly with other systems. Most likely that indicates you won’t be using this for every streaming use case if you have a large platform. However, there are teams and even entire companies that only leverage Azure.

Wrap it up

If you want to hear a bit more and see it hands on watch the video included at the opt of this post. Azure Stream Analytics is a useful service and is easy to get started. If it supports your use case, then you can get things running quickly and scale up or down as needed. If you think I got it wrong, please leave a comment. I have a lot to learn on Stream Analytics since I currently only use it when it’s the perfect fit for the use case.

Don’t take just my word for it, a few references (though based on older version of Azure Stream Analytics):

  1. Hey Dustin,

    Thanks a lot for having taken the time to write this post. It’s really appreciated by the team.
    It’s a very good summary of our service, and you make a lot of fair comments that we are working to fix.

    Now I have some corrections to make, and obviously I will be biased 😉

    We don’t offer a SQL-like language. We do offer SQL. This is definitely pedantic, but also close to my heart. SQL has flavors and we offer our own, that’s fair. But it is SQL. What it is though, is a T-SQL like language. We try to emulate T-SQL as much as possible, but it’s not T-SQL (yet).

    For CDC on top of SQL, I would recommend Debezium. Davide as a good write-up on the topic:

    I disagree that you should limit yourself to simple transformations in ASA. You can do very complex ones as SQL is very expressive for those, all the more with CTE/WITH query steps. I’m amazed by what our users manage to compute. We have some examples here:

    And no need to split your job into multiple ones: you can output any intermediary step in a single job.

    So complex (business wise) analytics queries: yes (stream processing, real time dashboarding, stream analytics, real time operational reporting…). Where we’re less comfortable will be around enterprise ETL scenarios, where you need code re-use, metadata driven development. central repositories, and things like that. You can build all that on top of ASA, but it’s not in the box. But streaming ETL is a big yes, as long as you have narrow pipelines on a specific domain and where code re-use is not that necessary.

    In the supported output we just added PostgreSQL and ADX, just FYI 😉

    For the catch: I personally recommend developing in VSCode always, potentially fully local at first. In my opinion the portal is not the best developer experience. Plus you get unit testing with the npm package.

    Thanks again for your article!

    • Thanks Florian for all this helpful information and perspective. The unit testing absolve in VS Code is a great point. The code reuse and test ability is what led me to suggest less complex transformations.

Leave a Reply