Data Engineer Question and Answer

An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what others have asked me in the past, so I decided to capture my responses here. I link to prior posts and other resources when possible to try and keep the responses brief. These are informal thoughts of mine, not something I have sat down to rethink and research for new ideas beyond what is already in my head.

1. What skills are required to be successful as a Data Engineer?

This is the most common question I hear. Since I heard it so often I wrote up a blog post about data engineer skills and presented at a few conferences with a focus on Azure skills. If you are interested, check out the short version or the long version.

The top 2 skills I recommend are SQL and Python. I think Scala is a great alternative to Python, it’s mostly a matter of preference. If you aren’t familiar with either language, try out a tutorial in both before choosing. If you are learning Python, a great place to start for data processing is to use Pandas. Another area to learn is reading data from Web APIs with the Requests library.

You may already know I use Apache Spark the most of any library or framework. You can get started in Python by installing the library PySpark. It is meant to run on a cluster but even practicing it just on your laptop is a great idea. Over time I have been convinced that spinning up a Databricks Community account to get some initial Spark practice is the easiest path. If you are already working in Azure with a pay as you go account, you can enable a trial Databricks workspace for 14 days to test it out (but please double check that all the VM resources included with it are completely free under the trial).

Working with one of the top clouds is also good, so try to get some experience with either Azure or AWS. I think Azure has a lot to offer for data engineering and quite a few enterprises use it. AWS is used by more companies and especially by more startups. I think either is fine to start with.There are also opportunities to use Google Cloud for data engineering but I don’t think it is makes it as easy to get started. If going the Azure route you should check out https://learn.microsoft.com/en-us/training/paths/azure-for-the-data-engineer/.

2. What online resources can help me learn the basics of data engineering?

I have researched and created my recommend list of Python training.

At this point in my career, I tend to pick out technologies I see are popular in the region based on job descriptions, then go walk through the QuickStart in the documentation. Sometimes that is easy to get some practice, but sometimes it takes a lot more research to learn it. Videos on YouTube can help show you how to get started or sites like Pluralsight have introductory courses and exercises for you to follow along.

Microsoft Learn has a bunch of learning paths which can be helpful: https://learn.microsoft.com/en-us/training/paths/azure-for-the-data-engineer/.

For Scala, I used a book (see question 3),YouTube videos, and learned from a colleague by contributing to his work. I also have high regard for training by Rock the JVM and Lightbend.

Here are some of my favorite conferences for learning fundamentals and have put many sessions online for free:

  1. PASS Data Community Summit (Azure and SQL Server focus)
  2. SQL Bits (Azure and SQL Server focus)
  3. Data and AI Summit (Apache Spark focus)

There are many free and paid options available online. I can’t keep up with all of them enough to be confident I know the best ones, but hopefully this guidance is helpful.

3. What books should I read to learn data engineering?

I started a long time ago, so what was helpful then may not be helpful now. I read “The Data Warehouse Toolkit” by Ralph Kimball. You can find free online training about dimensional modeling, also called star schema design, without buying the large book. However, the book has a lot of guidance about running projects and things also. That information isn’t as critical for a new data engineer but it helped me a lot throughout my career. It is a more traditional approach but understanding it can help you will modeling even in data lakes.

If you go the Azure route, there is a free e-book you can get, the Azure Synapse Analytics Cookbook.

For data engineering with Databricks, there is a free book on data engineering.

For Scala, I like any of the books by Alvin Alexander. The most recent one I am using to review the fundamentals is Learn Scala 3 The Fast Way.

4. How was your journey to data engineering?

I think this is the most complete story (two articles):

  1. https://dustinvannoy.com/2020/04/07/journey-of-a-data-engineer-part-1/
  2. https://dustinvannoy.com/2020/04/26/journey-of-a-data-engineer-part-2/

The short version is that I had a formal education in programming and databases, but to shift into data engineering I had to push myself a bit further on my own. I used some online courses that had assigned coding exercises. I also watched a lot of presentations at conferences. Later on I watched free online trainings and read blog posts, which is why I try to create my own content now to help others. The most important thing about my journey is that I used what I already was doing well and added new skills and concepts to the mix. By being successful and a good teammate at one company, I was recruited to join other companies even if I still needed to learn some of the important technical requirements for the role.

5. What challenges did you face?

There is a concept of imposter syndrome that I first learned about several years ago and it describes the feeling I wrestled with through getting my degree. I felt out of place the first time I was in a programming class and some of the students already knew a lot about programming. I remember some students had laptops and were programming right there in class as we were being introduced to the basics. I built up a little confidence since the teacher explained things slowly at the start, but by the second semester I felt lost and overwhelmed. I considered giving up and changing careers because I didn’t know if I was good enough. To be honest, I probably would have changed directions but I didn’t want to go home and tell people I failed at what I set out to do.

Skipping forward quite a bit, I encountered different challenges when I decided to learn big data technologies and shift to data engineering. I had to get out of my comfort zone where I was an expert in order to learn new skills. The main challenges for me were finding time to learn, dealing with things that just didn’t work as documented, and then forcing myself to work on my own projects to get real practice. Just to be clear, there were times that I thought, “Everyone else seems to get Python, why do I keep getting errors?” But then I figured out where I had incorrectly indented the code and suddenly I enjoyed the thrill of a working application. I learned it was ok to have ups and downs as long as you still get some joy out of finding solutions or building something new.

6. Is it required to get a certification?

No, but it may help prove what you know and it is a good way to force yourself to learn. I think the best option to prove what you know is to put in work on one or two side project (personal project that you choose) that can be shared publicly. It can be by working with a group that has a project they need help with or doing your own thing. Maybe write a blog post about it if you can. Most importantly, it will help show your abilities in an interview if you can list a personal project and then answer questions about what you did, what you learned, and why you made certain decisions.

If you do get a certification that involves taking a test to prove your skills, put that on your LinkedIn and your resume. It says something about what you have learned and that you invested in yourself. But it isn’t required for the job and may be ignored by many hiring managers.

Final Thoughts

Programming and data engineering don’t have to be your hobby in order to be successful, but when learning new skill sets you will probably need to dedicate some of your free time to learn it. It’s ok if it feels like work and you would rather be doing something else, but there should be times where you are excited because you built something that works. If you never find any joy in it, you probably don’t want to do it 40 hours a week. The money is good though, so there is no shame in pursuing it to build up wealth. I encourage everyone to not forget where you came from and what you had to go through to be successful. Remembering your past will lead you to be generous and help others.

1 Comment

Leave a Reply