Spark Structured Streaming is very powerful for streaming data pipelines, but it can get complicated for certain use cases. One of those use cases is joining streaming data to other changing data. Joining two streaming datasets works great if you can set watermarks and time constraints to limit how much data is kept in state. If one side of that join doesn’t change often, you can change this to a stream to batch join which doesn’t require using a Spark state store.
Use case
If you are building data pipelines for a video streaming site, you would need to consume analytics about video views in real time. Assume you need to look up additional user attributes like the subscription level, that information will change very infrequently. However, once that change happens its important to start tying usage to the correct subscription right away. So you need to find the best way to lookup that info in Apache Spark.
The problem
When joining a Spark stream to a batch dataset for a lookup, most batch sources will not update in memory. If you update the batch data once a night, you need to restart the streaming query for the lookup data (static dataset) to be updated. This isn’t too bad if it’s once a day, but when data trickles in throughout the day we need a better solution.
The solution
With Delta Lake format, the batch dataset will update in memory without restarting the stream. The video in this post shows an example of this in action. Delta Lake supports updates via the merge statement so you keep the data up to date in your file system and Spark will also update its in memory data frame.
The remaining problem
The downside of this solution is if the data to update your Delta Lake table is a little late, your streaming process has already produced records with the old lookup values. Assuming you need very precise joins, you will need to correct these infrequent issues somehow. This can be solved with a separate cleanup job.