In this video, I share with you about Apache Spark using the Python language, often referred to as PySpark. We’ll walk through a quick demo on Azure Synapse Analytics, an integrated platform for analytics within Microsoft Azure cloud. This short demo is meant for those who are curious about PySpark or just want to get a peek at Spark in Azure Synapse. If you are new to Apache Spark, just know that it is a popular framework for data engineers that can be run in a variety of environments. It is popular because it enables distributed data processing with a relatively simple API. If you want to see examples in Scala or C#, you can check out one of my other videos where I walk through a similar demo.
You can follow along to build a Spark data load that reads linked sample data, transforms data, joins to a lookup table, and saves as a Delta Lake file to your Azure Data Lake Storage Gen2 account. Please be aware that you will occur costs following this example. To keep costs minimal make the Spark pool small and keep the default 15-minute auto-terminate setting.
Demo code breakdown
The video has commentary and the ability to see this in action. For those interested, here is a breakout of each section of the notebook with a few written comments.
First, import the required libraries and set the path variables. Line 4 should work if you have linked the dataset (see video for how to do that). The other paths should be modified to your own linked ADLS Gen2 account. The pattern is abfss://<container>@<storage_account>.dfs.core.windows.net/<your/custom/path>. Finally, lines 14-20 define a schema for the Spark DataFrame that will be created when reading in the lookup data.
Next, use our SparkSession which is automatically available as variable name “spark”. This statement shows how to read a CSV (comma separated values) file. This file is not available in the linked dataset. To run this part, retrieve the Taxi Zone Lookup Table file from the NYC Trip Data site. You will need to host it in your account to run this step.
To run the main load you read a Parquet file. Parquet is a good format for big data processing. In this case, you are reading a portion of the data from the linked blob storage into our own Azure Data Lake Storage Gen2 (ADLS) account. This code shows a couple of options for applying transformations. Option B performs better than Option A, but for most cases I recommend choosing whichever syntax you prefer. I typically go with Option A for better readability.
Then in lines 25 and 26, we join the two datasets together. After a few column renames, we write out the data in a Delta format. Delta Lake is a newer format for use with Apache Spark and other big data systems. It is well supported on Azure Databricks and Azure Synapse Analytics.
The final step is just a quick read to confirm the output worked. You can select the top 20 and show the output to confirm the data looks valid (at least a small sample of it).
Please leave a comment if you have any questions. I have also made this demo available in Scala and C# for those interested in the syntax differences when using the other languages.