This is part 2 of my Journey of a Data Engineer series which all started from the question “What’s the best path to be a great data engineer?” Check out Part 1: From College to BI Developer for the path from college through my first role as a BI consultant. In this post I’ll cover the steps that took me from more traditional BI work (SQL, SSIS, SSRS, Tableau) to building scalable cloud-based data platforms. In my previous post I called out that there is more than one path to be successful. I’ll take it a step further and say being a “great” data engineer is a whole separate topic that I may not be able to answer. So now I will continue my story of how I progressed in the path to data engineer and what it looked like once I became one. I moved forward with no clear finish line, but I knew I was on the path to becoming a data engineer so I kept going. I hope my story will provide some insight which is helpful for wherever your journey may take you.
BI / ETL Developer to Director
When I started my role in San Diego as a BI Developer using the Microsoft BI suite, I had no clue what the next years would bring. I took the experience I had gained consulting and used a similar mindset. My team’s focus was on designing and building a critical data warehouse. We used SQL Server Integration Services (SSIS) to build out ETL that loaded data marts on a nightly basis. We had a small but talented team and I learned a lot from them while continuously arguing my views on designing a good data warehouse. I had a lot to learn in building data warehouses with the SQL Server stack, so over the first few years I learned from resources like SQL Server Central, Kimball Group, TDWI. I also relied heavily on various blogs found when I had specific challenges to solve in SQL Server or SSIS. The opportunity to improve skills on the job was always my best method of learning, but it did require putting in extra time as I came up to speed.
One important part of my role was understanding the key business rules and representing the technical side of user acceptance testing. This role was never actually assigned to me, but it was clear that to be successful in our efforts we needed to get the business users comfortable with the logic. A lot of my contributions came from recognizing a team need and stepping into it, rather than waiting for someone to assign me tasks. Fortunately these discussions led to relationships throughout the organization, people I could reach out to for answers. The connections and technical skills I gained help drive my growth. I let my boss know that I desired to work towards a lead developer or manager role and was soon given that opportunity. This led to learning and developing non-technical skills which I believe benefited all my future roles. After some time my boss resigned from the company and I was asked to step in as Director. Job titles can mean different things depending on the organization, but this was a big step up from where I expected to be when I started with the company a couple of years earlier. This director role provided many challenges and I felt immense pressure to protect and lead my team through difficult times at the company. It was tough, but this experience was ultimately good for my career development. I improved my leadership skills, learned about working across teams, and was able to focus on the big picture. As I worked my way through an online MBA program I realized that I would much prefer to learn new technology instead, so I quit my formal training and started learning concepts and tools around Big Data. This choice to focus on learning modern technology and skills is what pushed me from a Microsoft BI Developer/Manager track to being a full-fledged Data Engineer.
So what did I do to learn skills beyond SSIS and SQL?
- Attended trainings and user groups focused on new technology: Big Data, Hadoop, MPP databases, data science algorithms
- Designed use cases for how my team could use the modern tech
- Found free online courses and tutorials: Udacity, Coursera, YouTube, vendor sites
- Talked to colleagues who were learning the same topics
The steps I took aren’t revolutionary, I admit. Let’s break down some specific courses and what was beneficial. Through the Udacity Hadoop and Mapreduce course I was introduced to Python. This course made it easy by having a web-based development sandbox to complete the course exercises. I moved from there to a Coursera course on data science with Python. It was a challenge to keep up with the schedule, but I persevered and completed the whole course. A feat never to be accomplished again by me on Coursera, but this course was something I needed to learn and I made it happen. This course pushed me to install and set up a Python environment on my laptop which had its challenges. Beyond that it set up more realistic projects than the prior work I had done in Python. At this point I had learned quite a bit about Hadoop by listening to overview after overview until it finally made sense, but I knew I needed to get hands-on. A colleague pointed me to Cloudera which has a great getting started virtual machine plus simple instructions to get running. After following the simple tutorial I had a working Hadoop environment and a little knowledge of how to use it, so I started experimenting with other public data I found. I worked through question and errors by using Stack Overflow and blogs found through searches. While current popular technology and courses will be different than what I experienced, I do recommend a similar progression: starting with a simple course with interactive exercises then working up to building a demo project beyond what any tutorial covered.
Big shift to Data Engineer
One day a friend of mine and the former Director of my team explained he was working on a proposal to start a data team for Pluralsight, based in San Diego. He asked if I would want to lead up one of the teams and I responded with a strong “probably”. Fast forward a few months and it was my first day as Director of Data Warehousing sitting in a co-working space in downtown San Diego. Now it was time for me to build an ETL process in Python to move data from SQL Server to Big Query. At this point I had just become a data engineer and I didn’t fully realize it.
The adjustment to data engineering work from a purely SQL and SSIS environment was tough. I had already watched a few Pluralsight courses, one on Big Query and one on Python, since getting access a week prior. Next I needed to put the pieces together so I pulled up the Google Cloud documentation and started to fumble through the quick start guides and sample code. I had some success but ended up battling for many hours with Python library installations, Google Cloud command line authentication, figuring out the right environment for Python on a Windows laptop, and realizing how much I didn’t know about this world I just walked into. I was out of the tutorials and class assignments and deep into code that I would have to deploy and maintain. The main thing I learned at this time was to keep taking time to learn and do not hesitate to delete and rewrite parts of your codebase as you get better.
It’s important to call out the things a data engineer should be doing that I wasn’t doing in the first year (or more): writing automated tests, using classes when beneficial, setting up code deployment pipelines, scripting out infrastructure/service creation, and proper application logging.
In my time at Pluralsight we introduced more than just Python and BigQuery. We ended up introducing Hadoop, Impala, Spark, Scala, and Kafka. I was not an expert in all of these things, but we leveraged the different strengths on the team to get things built and keep then running. We made mistakes. We made decisions that may have been right at the time but I would do differently today. I learned a hell of a lot because ultimately I felt responsible for all of these technologies being successful even if I could count on others to be the experts. At this point I was learning from a combination of documentation, blogs, Pluralsight courses, local meetups, and so many Stack Overflow posts where people had made the same mistake I was troubleshooting.
The time came where leadership at the company decided to consolidate offices and asked my team to move. I stayed in San Diego and took a new role focused on using the Spark and AWS skills I had developed to build a new data system to support Intuit. I learned a lot from some smart developers and DevOps specialists. This system was complex and partially custom, so I used existing code and documentation to learn (plus some teamwork with other developers). There was no online course for the kind of thing we were doing, it was taking the components of what I had learned and implementing them in a new way.
The learning continues
And then I got a call saying “Have you ever thought about working for yourself?” I had been thinking about that for a few years but was hesitant to make the leap. I wanted to ignore this opportunity, but I knew that now was the right time to make that change. This change has opened up so many opportunities to learn and grow. It also brought me back strongly into the Microsoft space again. I pivoted to building the same types of data pipelines and platforms but now primarily with Azure services (and sometimes even in C#). I mostly learn from the documentation and quick start guides, plus some hands-on experimentation. The best way for me to learn now is to figure out how I would explain the technology to others who have similar experiences outside of Azure.
I still remember how essential the step by step tutorials were when first learning data engineering skills. I remember the frustration and wasted time when setting up my local development environments for Python, Hadoop, Spark, and Kafka. In part, I hope that the training content I provide will help others get a kickstart in their journey from SQL to Python, or SQL Server to Spark, or AWS to Azure. And so the learning continues for us all.