I hear questions quite frequently about what options are best for data pipelines? Should we write code using Pandas or Spark? Should we use AWS Glue or Azure Data Factory? Or maybe SSIS? Where do Airflow and Luigi fit?
I plan to dive into these technologies and provide more clarity into the options we have today, so stay tuned to this site for tutorials and other writeups. But given my experience in this space I want to share my current thinking which I expect will evolve as I dive deeper into the newest cloud offerings.
I used to work with and at times help sell an ETL tool that had a graphical drag and drop interface. I really did like the tool because with a little training you could quickly build a basic ETL job. These types of ETL tools took me far in my career and I still like them if pulling data from a database that has a static or slow changing data model. They make it simple to do simple things once we learn the basics. But there are limits to what the built-in connectors and transformation options can do, so we still end up writing logic in SQL or a programming language to cover additional cases. One important note, the applications I used did not have a way to convert the jobs built with a visual interface into code, which is something some of the newer technologies provide. The main downside with the tools I used is that the basic techniques expected in software development projects just didn’t fit well with the ETL tools I tried.
About 4 years ago I suggested we are better off without a graphical tool that abstracts a lot of the work. It was not because the ETL tools we could have chosen weren’t better for certain tasks, but overall we preferred Python and SQL to move and process our data. The primary reasons we went down this path were for increased flexibility, portability, and maintainability.
One of my top regrets leading a Data Warehousing team that used an ETL tool is that we felt limited by what the tool was capable of doing. Elements of ETL that were not as important when the team started were not easily supported by the tool. The best example of this was reading from a REST API. Another was working with JSON data as a source. I’m sure we could find a tool that can do this for us now, but what else will we encounter in the future? Most of the data we want to consume comes from cloud vendors or messaging systems such as Kafka. So can we find a tool that integrates well with everything we use now and in the future? If we are using core Python, Pandas, or Spark we have no limits on what is possible for us to build. There are many libraries that are already built which we can leverage, and we can modify our libraries as new ideas come up rather than being stuck with what a tool provides out of the box. In many cases we trade off having a longer ramp up period to get our first build working in order to have more flexibility and control down the road, but it cuts down the amount of frustrating rework when systems change.
I am sure there are plenty of different tools out there that do everything you could want to do (at least according to their sales team), but I love the flexibility, control, and maintainability of writing our own applications to move data. With the improvements in cloud data pipeline services such as AWS Glue and Azure Data Factory, I think it is important to explore how much of the downsides of ETL tools still exist and how much of the custom code challenges they have overcome. I welcome your comments on what option for ETL you prefer, and stay tuned to hear more about my journey in investigating some of the top options available today.
Hey dustin – thanks for the blog post as my company is going through this decision right now. Has your thought process evolved over the past 6 months in this area?
Hey Ricky, I still would hesitate to go all in on an ETL tool unless you can easily customize the underlying code, but a hybrid approach seems more appealing to me than when I first wrote this. I’ve seen great examples of using Azure Data Factory mostly for control flow or basic data copies then calling Spark notebooks from those flows for the more complex work. As long as you don’t let the ETL tool limit what the team can do, and how easily you can adapt to new data types and schemas, then a hybrid is a fine option. Personally I like something like Apache Airflow with Databricks but the main goal is that the development team can affordably use the right tool for the job and not be overwhelmed with too much complexity or too little control.
Hi Dustin, I agree with most of what you said. Hybrid does seem to be the best of both worlds. That being said, I can think of three advantages of an ETL tool.
1. ETL tools is generally better in data lineage and impact analysis in case of source change.
2. Pipelines built in ETL tools are easier to monitor and support by entry level IT personnel such as service desk. Most ETL tools come with a nice GUI for monitoring. Logging is usually pretty comprehensive and meaningful. We have a 24*7 service desk and they do all the monitoring and 1st level troubleshooting by following knowledgebase articles/instructions on common failures. Data engineers are only on 2nd/3rd level support. This operating model proves to be very cost effective.
3. Most of the ETL tools I have worked with have better error handling mechanisms. Surely you can write your own, but these little things could add up pretty quickly.