Spark Streaming join to Slow Changing Data

Spark Structured Streaming is very powerful for streaming data pipelines, but it can get complicated for certain use cases. One of those use cases is joining streaming data to other changing data. Joining two streaming datasets works great if you can set watermarks and time constraints to limit how much data is kept in state. If one side of that join doesn’t change often, you can change this to a stream to batch join which doesn’t require using a Spark state store.

Use case

If you are building data pipelines for a video streaming site, you would need to consume analytics about video views in real time. Assume you need to look up additional user attributes like the subscription level, that information will change very infrequently. However, once that change happens its important to start tying usage to the correct subscription right away. So you need to find the best way to lookup that info in Apache Spark.

The problem

When joining a Spark stream to a batch dataset for a lookup, most batch sources will not update in memory. If you update the batch data once a night, you need to restart the streaming query for the lookup data (static dataset) to be updated. This isn’t too bad if it’s once a day, but when data trickles in throughout the day we need a better solution.

The solution

With Delta Lake format, the batch dataset will update in memory without restarting the stream. The video in this post shows an example of this in action. Delta Lake supports updates via the merge statement so you keep the data up to date in your file system and Spark will also update its in memory data frame.

The remaining problem

The downside of this solution is if the data to update your Delta Lake table is a little late, your streaming process has already produced records with the old lookup values. Assuming you need very precise joins, you will need to correct these infrequent issues somehow. This can be solved with a separate cleanup job.

DUSTIN VANNOY

Spark Streaming join to Slow Changing Data

Use case

The problem

The solution

The remaining problem

Like this:

Leave a comment

Leave a ReplyCancel reply

About

Featured Posts

Claude Code Essentials for Data Professionals

Cursor with Databricks: AI Enhanced Development

OSS Spotlight: Unity Catalog

Essential Best Practices for Data Engineers on Databricks

PASS 2024 – Databricks Resources for DevX and CICD

Databricks Asset Bundles: Advanced Examples

Use case

The problem

The solution

The remaining problem

Share this:

Like this:

Leave a ReplyCancel reply

About

Stay informed

Featured Posts

Discover more from DUSTIN VANNOY