Introduction to Apache Kafka
In a data-driven approach, business decisions are based on data, not intuitions. Hence, data-driven organizations use this strategic process to leverage insights from data and identify business opportunities to improve customer service and increase sales.
As everything revolves around data, transmitting data from one place to another is challenging, and doing this with real-time data is even more challenging. So to solve this problem, there comes a perfect distributed data streaming platform called Kafka. Next, the question arises: What is it, and how does it work? Let’s look into it.
The architecture of Apache Kafka
1. Kafka API Architecture
Apache Kafka’s architecture revolves around four core APIs: Producer, Consumer, Streams, and Connector. Here’s a breakdown of each:
- Producer API: This API allows an application to send a stream of data to one or more Kafka topics.
- Consumer API: This API enables applications to subscribe to one or more topics and manage the stream of data sent to them.
- Streams API: This API allows an application to act as a stream processor. It consumes input streams from one or more topics, processes them, and then produces output streams to output topics.
- Connector API: This API provides the capability to create and run reusable producers and consumers that connect Kafka topics with other applications or data systems. For example, it can link to relational databases to track every change made to tables.
2. Kafka Cluster Architecture
Let’s take a closer look at the key components of a Kafka cluster:
- Kafka Brokers: A broker is a server that is part of a Kafka cluster. A cluster is typically formed by multiple brokers working together to provide load balancing, redundancy, and failover. Brokers use Apache ZooKeeper for cluster management and coordination. Each broker can handle large volumes of read and write operations (up to tens of thousands per second) without compromising performance. Each broker has a unique ID and manages partitions for one or more topic logs. They also participate in leader elections through ZooKeeper, which determines the broker responsible for handling client requests for specific topic partitions.
- Kafka ZooKeeper: Kafka brokers use ZooKeeper to manage and coordinate the Kafka cluster. ZooKeeper informs all nodes when there are changes in the cluster, such as a new broker joining or an existing broker failing. It also helps in leader elections among brokers and topic partitions, determining which broker will lead for each partition and which brokers will hold replicas of the data.
- Kafka Producers: In Kafka, a producer is responsible for sending records or messages to a specified topic. Producers can also decide which partition a specific record or message should be sent to, providing additional scalability. If no specific partition is defined, topics can be balanced in a round-robin manner.
- Kafka Consumers: Since Kafka brokers are stateless, the consumer tracks how many messages it has consumed from a partition by maintaining an offset. Once a consumer acknowledges a certain offset, all previous messages are considered consumed. Consumers send asynchronous pull requests to brokers to get a buffer of bytes ready for consumption. By providing an offset value, consumers can fast-forward to any point within a partition. ZooKeeper informs consumers about the offset value.
Who are the producers of it?
A producer is a kind of client responsible for publishing records to it. These Producers are thread-safe, i.e., they have a pool of buffers that holds the to-be-sent records. These buffers are available immediately and sent as fast as brokers can handle them. The records sent by producers are serialized into an array of bytes before being sent. When a producer sends a message, the broker has three types of acknowledgements, which you can configure, and by default, ack = ‘all.’
- ACKS 0 (ALL): The producer doesn’t wait for the broker to acknowledge. The messages added to the buffer are considered to be sent.
- ACKS 1 (Leader): The producer sends the record to it and waits for acknowledgement by the leader partition but not by the follower partition.
- ACKS -1 (ALL): The producer sends the record to it and waits for the acknowledgement by the leader and follower partitions, hence providing full fault tolerance.
Who are the consumers of it?
A consumer is a client who is responsible for subscribing to a specific topic and starting to receive records from it. When the speed at which the producer produces the message exceeds the speed at which the consumer consumes it, the Consumer group is formed.
Consumer Group refers to the group of consumers who subscribed to the same topic. Like multiple producers can send data to a single topic, multiple consumers can receive data from the same topic by dividing the partitions between them.
Partitions and Replications
Its records or messages are stored in Topics, further divided into Partitions. The actual data is stored in these partitions, and the main advantage these partitions provide is parallelism. Internally, the working of partitions depends on the key: null key or hash key. The data will be sent to any partition if the key is null. But when a specific hash key is provided, the data will be sent to a specific partition.
Replication
Replication is the concept that ensures reliability and availability when a Kafka node eventually fails. As per the name, replication is having multiple copies of data. In it, the replication occurs at partition granularity, so every partition has multiple replicas. Among these replicas is one Leader Partition, and the rest are Follower Partitions.
- Leader Partition: The role of leader partition is to get the messages from the producer and send them to the consumer when requested. Each partition can only have one leader at a time.
- Follower Partition: The replicas of a partition that are not leaders are all follower partitions. They do not handle client requests; their only job is to replicate data to themselves as soon as it arrives at the leader partition.
Apache Kafka is used for Real-time Data Streaming.
Real-time streaming refers to quickly processing data so firms can respond to changing conditions in real-time. It is widely used for real-time streaming. As we have read above, it has producer APIs, and here, the producer can be a web host, web server, IoT device, and many other sources that send big data continuously. Then, the consumers or spark streaming listen to these topics and consume the data reliably. Let’s have a basic understanding of how real data streams in it.
Extraction of data into Kafka
The first step is the extraction of data to it. Our primary goal is to get data for our application. It provides its connectors run with the Kafka connect framework for copying the data from multiple sources. It connects stream data to other data storage devices. It is a data hub that can ingest databases into it, making data available at low latency for streaming. Readily available connectors, such as JDBC can be used as source connectors. Suppose a case where we have a data source as SQL lite. The JDBC connector can be used to ingest data into it. So, the JDBC will process the records individually, creating the stream.
Transformation of the data using Kafka Streams API
In the previous phase, as we have written the data from the source to its topic, multiple applications can now read data from these topics, which will not be enriched. We can apply transformations and aggregations to the data using its stream APIs for real-time processing. We can transform one message at a time or apply filters based on the conditions. We can also apply to join operations, aggregations, or windowing operations on the records. Once done, we push this enriched data to its topic so other sources can consume it.
Downstream of the data
Till now, we have fetched data from sources and applied transformations to it, and now comes the last step, i.e., downstream or sinking the data. There can be multiple target sinks; to handle them, we can use Kafka to connect with multiple connectors so that any number of systems can receive the data. For example, an Amazon S3 bucket can act as a sink source, connect the output topic with the sink connector, and run it using its connect, and we will have our data in the required form.
What are the Use Cases for Apache Kafka?
- Messaging: It is a real-time data streaming application. It can be used for messaging systems. It has better fault tolerance, throughput, built-in partitioning, and replication than other messaging systems. All these factors make it an excellent alternative to large-scale message-processing applications.
- Log Aggregation: It can be used for log aggregation. From servers, collecting physical log files and putting them to a file server or HDFS for processing is called Log Aggregation. These logs are sent to it, and it returns a cleaner abstraction of these logs or event data as a message by applying the transformations at very low latency.
- Metrics: It is often used for operational monitoring data as it aggregates statistics from distributed applications.
- Commit Logs: It can serve as an external commit log for a distributed system. The logs stored in it are replicated, providing complete fault tolerance, which helps to keep vital data records such as bank transaction records.
Optimizations for Real-time Data with Kafka
It is already optimized out of the box. Still, there are some ways by which you can improve the performance of its clusters. Two main data metrics for Kafka to consider are:
- Throughput: The total number of messages arriving in a given time.
- Latency: the amount of time it takes to process each message.
It seeks to optimize both throughput and latency, which optimizes it. However, some tunings can be made based on the type of job or application we are working on.
Tuning the Brokers
As we know, a topic has partitions, and increasing these partitions results in increased parallelism while producing and consuming the message. Hence, an improved throughput can be achieved.
Tuning the Producers
Producer runs in 2 different modes:
- Synchronous Mode
- Asynchronous Mode
In synchronous mode, a request is sent to the broker as soon as the publisher publishes a message. This means that if there are 1000 messages per second, 1000 requests will be produced, resulting in decreased throughput. So, when sending many messages, choose the asynchronous mode.
Compressing the messages
Compressing the message results in improved latency, as the message size decreases, and small packets can be sent at high speed. However, Kafka messages are not compressed by default and need custom configurations.
Tuning the Consumers
- Consumers receive the records in batches, and if we set the batch size for pulling the messages too high, it may take a lot of time to process each one, decreasing the throughput. Similarly, polling the broker for a single message every time, causing too many broker requests, decreases the throughput.
- Having more partitions and consumers indeed increases parallelism, but as the number of consumers increases, the number of offset commit requests also increases. These requests lead to an increased load on brokers, resulting in low throughput.
Considering the real-world scenarios having multiple sources and targets for ingesting data while supporting variable schema that evolves, it carries a lot of overhead. It is a multi-step process with complex transformations requiring total durability and fault tolerance. It provides the perfect architecture to do so. You can flexibly build streaming ETL pipelines and avoid messy processes. In this blog, we learned that. Its Connectors run the corresponding framework that can help to load and sink data from it to any system or database. With the help of Kafka stream API, you can easily apply transformations and aggregations, Join multiple data sources and filter data.