As a data engineer, you should not be trying to convince your colleagues that everything can be a scheduled batch job. It’s time to learn how to building streaming data pipelines. For many data engineers, Apache Kafka is the go to platform for enabling real-time data pipelines. Let’s quickly cover why and how to get started.
When building a streaming architecture, we need a reliable place to collect events and allow downstream applications to consume. We need a scalable message broker. Producing applications will publish data without knowing all the details of what is happening on the other end. Multiple consumers can read that data from the message broker without being aware of one another. In addition, it’s helpful if the message broker can persist that data so the consumer can replay messages as needed.
Apache Kafka is a very popular open source platform for streaming data that meets these criteria. It can keep up with a high volume of messages from producers and scale as data needs grow. Produced data is stored to disk and replicated across nodes for redundancy. Kafka is talked about as a distributed log. Events arrive and are stored in order within each partition. Consumers can replicate that data to a data store as inserts, updates, and deletes. Another consumer could read history and populate its own data store with the same results, since order is maintained.
Apache Kafka has a ton of capabilities and configurations, but only a few concepts are important for a developer to get started.
- Topic => feed where records are published
- Each entity or event type likely goes to its own topic. It is similar to how we group records into tables in a database. Often each record within a topic has the same schema. An example of a topic might be User, PageView, or SalesLine.
- Partition => segment of data within a topic; enables parallel consumption
- Producer specifies a partition key which is used to make sure data with the same partition key goes to the same partition. This keeps the data in the order it was produced. Examples of partition keys are user id or customer id.
- Offset => record position within a partition
- Consumers keep track of which offsets have been read to make sure they are retrieving the next record. To replay data, you specify an earlier offset and the application will read in order from that starting offset.
Alternatives to Kafka
Each cloud provider has its own service that works like Apache Kafka. For Azure, you could use Event Hubs which even has the option to use the Kafka APIs in your application. AWS has Kinesis to provide this functionality or Managed Streaming for Apache Kafka (MSK) to run the real thing. Google Cloud offers its PubSub service for similar workloads. And Confluent, the company founded by several of the Kafka creators, has a full ecosystem for working with Kafka either in their managed cloud service or on-premises.
I have worked with Apache Kafka quite a bit and have confidence in its use for production systems. I have also used Azure Event Hubs for real projects, primarily consuming data with Apache Spark and the Kafka API. I have not had to design data pipelines that use Event Hubs any differently than I would with Kafka which I have been very happy about. It is a different platform and code base though so do your own proof of concept if considering a streaming project.
Where to learn more
There is a lot to know about Apache Kafka, especially if you will be installing and maintaining the cluster. I encourage smaller companies to look at managed offerings first, such as Confluent Cloud or one of the cloud provider alternatives. Following the instructions for using a Confluent Docker setup is often a good option to get started. If getting Docker configured properly does not go smoothly you can install Apache Kafka locally on a Mac or Linux environment.
Below are a few links for learning more and getting started.