
Data Engineer is an exciting and rewarding role. However, many are not sure what a data engineer does. Some core things are generally expected if someone claims they are a data engineer. Each person may not do all of these and roles vary by company. That being said, based on my experience in the field and many discussions with others, I present to you how I define the role Data Engineer! Drumroll please…
DATA ENGINEER RESPONSIBILITIES
Collect Data
Organizations have many opportunities to track and access data. This leads to a need to store the data efficiently and often pull the data together into a data lake. While this should be an organization-wide effort, data engineers handle a lot of the technical work to make the collection of data possible.
Make Data Available
The data cannot merely be collected, it must be made available. Data engineers are often tasked with setting up the tools, security, and policies to let people query the right data for their project. At a minimum they have to think about how this data will be accessed to store it in an optimal way.
Transform / Clean / Curate Data
Data is never clean and perfectly structured for analytics. The hard work to clean, transform, structure, and organize the data falls on the plate of data engineers. Sure others need to do this type of work from time to time, but data engineers are usually expected to handle as much of this as possible to reduce the burden on other teams.
Secure The Data
If you have ever read the news you know that data breaches happen often and can have terrible consequences. Hopefully, a security team is trying to protect all company systems, but data engineers play an important role in this since they work with the company’s most important data.
Automate
Data engineers are tasked with automating many things. At a minimum a data engineer will be tasked with taking a manual process (report, query, model training, etc) and automating it on a set schedule with all the monitoring and retry capabilities required. Ideally they also follow software development practices of automating code deployments and even infrastructure deployments.
Support Data Scientist / Analysts
It is common for data engineers to work closely with data scientists and analysts. Often the work includes providing reporting data sets or productionizing machine learning models. This relationship may be formal or informal. Regardless of organization structure these roles will be focused on delivering together. Often the real business value is reached through the final deliveries of the data scientists and analysts.
DATA ENGINEER TASKS
Develop Data Pipelines
Data pipelines, also referred to as ETL, are processes for moving and transforming data. This could be stream processing or batch loads. For some data engineers this is the core work they will do while others are focused on enabling these capabilities.
Build Data Platforms / Tools
Rather than building data pipelines a data engineer may focus on developing a platform or special tools to enable others to move and query data effectively. This work requires a higher focus on quality software development practices. This is not a task that every data engineer will do.
Define Table Schemas
Whether working with data lakes or data warehouses the datasets to be used will have some sort of schema. Often data engineers are the ones determining the tables, columns, relationships, and indexing or partitioning for the data to be used for analytics.
Manage Data
Managing data size and monitoring that jobs are working effectively is another important task. The specifics vary depending on the technologies but monitoring data size, archiving old data, updating statistics, and managing permissions may be required.
Productionize Data Science Models
Data scientists may rely on data engineers to take the models they have developed and rewrite them to be more stable for production. This can range from writing featurization code to model deployment to collecting results for evaluating model accuracy.
WHAT DO YOU THINK?
Defining a data engineer is a broad topic and I intentionally kept this list focused. These tasks and responsibilities are based on my experience hiring data engineers and reviewing relevant job descriptions. To expand on the topic in the future I will share about the skills and traits of a successful data engineer. I invite others to contribute to this discussion and share their own perspectives.