Azure Synapse Analytics: What the WHAT?

Azure Synapse Analytics just went Public Preview so now you can access all kinds of capability. Here is a quick introduction to what it is and why it matters.

The problem

Cloud analytics architectures are complicated. To build a modern analytics environment with a data lake and data warehouse you need multiple services that are all accessed from different places in the portal. Let’s take a look at those services:

Azure Analytics services

  • Databricks or HDInsight
  • Data Lake Storage (Gen2) or Blob Storage
  • Data Factory
  • Azure SQL
  • Azure SQL DW

AWS Analytics Services

  • Databricks or Elastic Map Reduce (EMR)
  • S3
  • Glue
  • Athena
  • Aurora
  • Redshift

Selecting and configuring the correct services for your workload is confusing. The work to make one service connect to another is a challenge, though great solutions can be built by leveraging these tools effectively. The ultimate problem: This is not a unified solution for a data scientist or data engineer to be productive!

Azure’s solution

Synapse Analytics

Azure Synapse Analytics brings together the core capabilities into a single experience. The Synapse platform includes many of these capabilities directly while others are well integrated so the user experience is much nicer than the alternatives. According to Microsoft, Synapse offers some ideal traits for an analytics platform: limitless scale, unified offering, integrated security, and serverless data lake querying.

Key Features
  • Spark for data processing and data lake queries (plus .NET Spark)
  • Serverless SQL for easily querying data lake storage
  • SQL DW for high performance analytic queries using MPP database
  • Notebooks as a light-weight and easy to share development UI
  • Synapse Pipelines for no-code or low-code data ingestion

The most interesting change from my perspective is the ability to query the data lake without having a full blown Azure Databricks environment separate from the DW. For heavy lifting Spark workloads I expect to be using Azure Databricks for a while still, but I will be looking at Synapse Spark pools as the place Microsoft will be looking to build even tighter integration. The Serverless SQL, also called SQL on Demand, gives us an alternative to Spark where we can query Azure Storage and pay per query for the processing amount of data processed (similar to Athena or BigQuery pricing model).

There is a lot more information out there plus you can get your hands on it in your own Azure account (but watch the costs and pricing closely). The Azure Synapse Analytics documentation has more information to get into the details.

Synapse Link

Azure Synapse Link is now available for real-time analytics. Synapse Link is “a hybrid transactional and analytical processing (HTAP) capability using Cosmos DB and Azure Synapse”. What that means is that Synapse Link enables near real-time analytics (within 30 seconds) on your Cosmos DB data without you writing ETL. To get started you can configure this from your Cosmos DB database. Once enabled, you will see a Cosmos DB section within the “Data” pane of Synapse Studio. You can query the latest data via a notebook or visualize it in Power BI. This is a really cool feature that you can read more about on the Microsoft Dev Blog.

    • Will be good to see how well it lives up to the challenge. I have worked with it enough to be confident it will become a strong competitor for some production workloads, but it’s somewhat early stages still. Granted, last time I used Athena and Glue they both felt immature still.

Leave a Reply