I hear questions quite frequently about what options are best for data pipelines? Should we write code using Pandas or Spark? Should we use AWS Glue or Azure Data Factory? Or maybe SSIS? Where do Airflow and Luigi fit?
I plan to dive into these technologies and provide more clarity into the options we have today, so stay tuned to this site for tutorials and other writeups. But given my experience in this space I want to share my current thinking which I expect will evolve as I dive deeper into the newest cloud offerings.
I used to work with and at times help sell an ETL tool that had a graphical drag and drop interface. I really did like the tool because with a little training you could quickly build a basic ETL job. These types of ETL tools took me far in my career and I still like them if pulling data from a database that has a static or slow changing data model. They make it simple to do simple things once we learn the basics. But there are limits to what the built-in connectors and transformation options can do, so we still end up writing logic in SQL or a programming language to cover additional cases. One important note, the applications I used did not have a way to convert the jobs built with a visual interface into code, which is something some of the newer technologies provide. The main downside with the tools I used is that the basic techniques expected in software development projects just didn’t fit well with the ETL tools I tried.
About 4 years ago I suggested we are better off without a graphical tool that abstracts a lot of the work. It was not because the ETL tools we could have chosen weren’t better for certain tasks, but overall we preferred Python and SQL to move and process our data. The primary reasons we went down this path were for increased flexibility, portability, and maintainability.
One of my top regrets leading a Data Warehousing team that used an ETL tool is that we felt limited by what the tool was capable of doing. Elements of ETL that were not as important when the team started were not easily supported by the tool. The best example of this was reading from a REST API. Another was working with JSON data as a source. I’m sure we could find a tool that can do this for us now, but what else will we encounter in the future? Most of the data we want to consume comes from cloud vendors or messaging systems such as Kafka. So can we find a tool that integrates well with everything we use now and in the future? If we are using core Python, Pandas, or Spark we have no limits on what is possible for us to build. There are many libraries that are already built which we can leverage, and we can modify our libraries as new ideas come up rather than being stuck with what a tool provides out of the box. In many cases we trade off having a longer ramp up period to get our first build working in order to have more flexibility and control down the road, but it cuts down the amount of frustrating rework when systems change.
I am sure there are plenty of different tools out there that do everything you could want to do (at least according to their sales team), but I love the flexibility, control, and maintainability of writing our own applications to move data. With the improvements in cloud data pipeline services such as AWS Glue and Azure Data Factory, I think it is important to explore how much of the downsides of ETL tools still exist and how much of the custom code challenges they have overcome. I welcome your comments on what option for ETL you prefer, and stay tuned to hear more about my journey in investigating some of the top options available today.