Data Engineer Skills for Success

Data engineer job descriptions vary significantly as they are asked to work on many different projects. The reality is the person hired into a data engineer role never has all the skills that are desired when joining a team. Yet, there are categories of skills that are consistently desired in a data engineer and serve as a foundation for learning new technologies. If you missed it, check out my first post in this series on data engineer responsibilities and tasks. The next step in this series is identifying the key skills and tools of a data engineer. Here are the skills I see as most critical for success as a data engineer.

Expected Skills

SQL

Structured Query Language (SQL) is the primary language for interacting with databases. Even when working with big data systems there is often an interface to use SQL or something based on it.

Python / Scala / Java

Programming is a very valuable skill for a data engineer and the go-to languages are Python, Scala, or Java. Now the types of tasks a data engineer does can be developed in all sorts of languages and many no-code tools are out there. However, if you are looking to be a data engineer then these are the languages you are most likely to be using.

Cloud Analytics Services

Many companies are using the cloud for most or all of their analytics. A data engineer is often asked to use at least a few data services in the cloud. Sometimes it is up to the data engineer to choose the right services for the use case. Knowledge and experience in AWS, Azure, or Google Cloud is going to be important to get hired in these cloud-focused data teams.

Spark / Hadoop

As data engineers are asked to work with more and more data, distributed data processing systems like Apache Spark and Apache Hadoop are often the backbone of the data platform. Even if Spark is not used directly, knowledge of Spark is helpful to understand more about how to scale the different cloud services. I could talk about Spark all day, but let’s move on to more skills.

Database Architecture

A database is often one of the destination systems for handing processed data off for data scientists use. That critical handoff can go very wrong without knowledge of how databases work and proper data modeling, indexing, and partitioning techniques.

Software Development Practices

Are data engineers a subset of software development? Sometimes they are, sometimes they are just close. But the practices developed in software development should be understood by all senior data engineers. These practices include clean code, unit testing, test-driven development, continuous integration/deployment, and many more.

Useful Skills

ETL or Orchestration Tools

Whether a data engineer writes a ton of code or is adept at using low-code ETL tools to stitch together data pipelines, simple ETL and orchestration tools are helpful. My go-to example for this is Apache Airflow which has many connectors to do data ingestion but also can trigger custom code with the appropriate dependencies and retry logic. Some experience in this space is helpful to determine when it’s appropriate to use a tool rather than simply jump to a “code is better than an ETL tool” perspective.

Streaming Platforms

Streaming data and real-time analytics have arrived. If real-time data is not yet a reality within the organization, it is being talked about for the future. Apache Kafka as a distributed log that collects all the event data across the organization is becoming more commonplace. Understanding how to use Kafka or similar services as the hub for data to stream in and out of is very helpful when developing a modern data pipeline.

DevOps

Understanding the purpose and importance of DevOps is helpful for a data engineer. Sometimes a data engineer is responsible for automating deployments for both code and infrastructure. In other cases, the data engineer needs to work with a specialist who will set this up.

Data Visualization

Visualizing data and using reporting tools is a helpful skill for anyone in the data space. A data engineer may be asked to work with the tool or support others. If this is the focus of the job it is more likely to be considered a data analyst or business intelligence developer role.

Machine Learning / AI

Understanding of machine learning is important since much data engineering work is to support this function. If one is good at machine learning then a job title of data scientist or machine learning engineer is more likely than data engineer.

COMMON PLATFORMS AND TOOLS

While many options exist and each cloud has its own offerings, here is a breakdown of some likely options for modern data platforms. Most environments will contain several of these (or their competitors).

  • Apache Spark for data pipelines or data lake querying
  • Python or Scala for writing ETL scripts or building a platform
  • Apache Airflow for orchestration and simple data transformation
  • Relational databases such as Postgres or MySQL
  • Presto for a distributed query engine
  • Apache Hive for SQL on Hadoop
  • Analytic databases (column-oriented, MPP systems) such as Impala

CLOUD ANALYTICS SERVICES

Since the cloud is often the place our analytics platforms are built, it’s important to understand which cloud options fit into the same grouping as the tools I mentioned above.

Data Ingestion and Transformation

Open SourceAzureAWSGoogle Cloud
Apache SparkDatabricksDatabricksDataproc
Python or ScalaSynapse SparkEMRDataflow
Apache AirflowData FactoryGlueComposer

Data Lake Query

Open SourceAzureAWSGoogle Cloud
Apache SparkDatabricksDatabricksDataproc Spark
PrestoSynapse SparkEMRDataproc Presto
Apache HiveSynapse SQLAthenaBigQuery

Data Warehouse / Analytics Storage

Open SourceAzureAWSGoogle Cloud
Postgres/MySQLAzure SQLAuroraCloud SQL
ImpalaSynapse SQLRedshiftBigQuery

Streaming Platform

Open SourceAzureAWSGoogle Cloud
Apache KafkaEvent HubsKinesisPub/Sub

EASIER THAN IT SOUNDS

This is quite the list of technologies. Do not be discouraged or overwhelmed by it. If you are trying to build up your skills and improve your chances of being hired as a data engineer, you can do it. This list is not intended to be a gate keeper but to guide on where to build up skills to be successful.

Is this list of technologies biased toward my experience? Well, maybe. I am not an expert in all of these and would not expect someone who joins my team to be either. For example, if a data engineer doesn’t automatically know what “Event Hubs” or “Kinesis” refer to that is completely fine (though they should know the purpose of Apache Kafka). However, I expect a successful data engineer can have a reasonable conversation about each of these and know which may be the right fit for our analytics architecture.

3 Comments
  1. Can you recommend a good frame, Dustin. I’ll be putting this on my wall. I’m thinking a simple frame so my attention is directed towards the written content. (=: Also, thank you for validating my learning path.

Leave a Reply