I’ve presented on building data lakes in Azure many times now. The focus is on Azure Databricks and Azure Synapse serverless (Spark or SQL). Here are questions I have been asked.
For data lakes in Azure, will Synapse Analytics replace Azure Databricks?
That is the question! At this point the answer is probably not yet. In a year from now this may be a more interesting decision. Azure Synapse is “preview” and not “generally available”. This means you will encounter issues and limitations while they get to their first full release. The experience in Synapse serverless Apache Spark is not as smooth as using Databricks. If you are using Apache Spark for a lot of transformations (any compute heavy workloads) then Databricks is probably going to work better for you. If you are doing streaming then you should be using Databricks in most cases. However, if you are querying the data lake then using Synapse serverless SQL and serverless Apache Spark are nice options that are integrated in the same portal as other key analytics services.
If source data is in a relational database and the destination is relational, should we load that data to the Data Lake?
You may choose not to, but often I find there is value in copying most of that data to the data lake as well. Let’s assume the destination is Azure Synapse dedicated SQL pool (formerly SQL DW). You could load the raw data to the data lake and use COPY INTO command to load from data lake into a managed table in your SQL pool. The COPY INTO command should perform very well for your bulk load and you now have data easy to query from Spark notebooks and serverless SQL scripts. Assuming you have other data sources in your data lake, you can do joins and other integration there without having to move everything to the more expensive dedicated SQL pool just to join your data together.
If your destination is Azure SQL DB, then the reasons you still may choose to go to a data lake are store all history for cheap, do batch analytics with other data sources, or offload ETL work from the SQL instances which is a common practice if data size is large enough.
To manage size of your data, you can do incremental pulls (only changes) throughout the day or even stream the data changes in real-time, though streaming adds complexity to the picture.
Where are Delta Lake tables stored?
Delta is another file format that we can use in Spark to get good performance for analytics. Under the covers it stores at the path you specify, which is often going to be Azure Data Lake Storage Gen 2 (ADLS). If you save it as a managed table within Databricks, it will be stored on Azure Blob Storage that is a managed resource tied to your Databricks workspace. If you are going to delete your workspace, you would need to move that data to another location. That is why we often point it to a separate ADLS container that we have configured for access from Databricks. Either location will perform pretty well so I haven’t actually found a reason to prefer one over the other.
Should I use Spark or Azure Data Factory for data pipelines (ETL)?
This depends on the scenario and on personal/team preference. To start with my preference: Spark is the go to engine for data pipelines using your data lake. I say this because with Spark you have a lot of power and control over what you are doing. Since you are writing code you will want to set up version control and likely develop from an IDE such as PyCharm or VS Code instead of only working in notebooks. By following this pattern, you can easily reuse code by creating functions and/or classes which makes changes that impact your whole ETL codebase much easier to implement. My concern with using Azure Data Factory or Synapse Pipelines is the maintainability. Debugging is not fast. Table schemas often get saved with the code and are harder to change without modifying JSON directly, and you can’t export the whole thing as easy to read code. If I’m wrong about any of these things please leave a comment.
The positives of Azure Data Factory or Synapse Pipelines from my perspective are: simple copy tasks and orchestration. You may need to run many different Spark jobs in order and only if dependencies were successful. Data Factory is a good solution for that. It would be similar to how I use Apache Airflow if implementing this on-premises.
How do I make sure file size is not too small (greater than 128 MB)?
First, the 128 MB isn’t a hard number these days, but you do want to avoid many small files when you could do with fewer large files. To start with you typically choose partitions (subfolders) that will not lead to small file sizes. If small file size is because you are streaming data or running small batch loads, then you will want to compact the files on some sort of maintenance schedule. For Delta format from Databricks, you have an OPTIMIZE command you can use in a scheduled script for any tables that need it. For Parquet files or other formats, you need to develop something yourself.
Due to smaller file sizes, with streaming we often do not write directly to files but write to the final destination database or to Event Hubs or Apache Kafka. You could then have a job that runs every 10 minutes or hour to save to files. If that will still create files that are too small, then the guidance above still applies.
Which language should we use to write Spark code?
Scala or Python are both really good options. Scala has some great features that work well with Spark. Since it is running on the Java Virtual Machine, the Scala code works natively and you sometimes see performance benefits. However, I would not choose Scala just for that reason. Python is an easier language for many developers to read and to learn. It also has really powerful libraries you can easily install and use. Python was easier for me to learn with Spark after using mostly SQL for the first part of my career.
C# has now hit 1.0 release but I am not sure how easy it will be to build high performing Spark jobs with it right now. It does not come setup on Azure Databricks which is one drawback. I also do not expect you to find many code examples on Stack Overflow or other sites.