Kafka vs. Pulsar
In this ever-changing era evolving businesses’ access to data unlocks pathways for many opportunities, innovative categories and gives an edge over their competitors. As a result, enterprises are leveraging their data to get insights for gaining such advantages. New technologies play a vital role in their success for creating such opportunities.
Thus, via this, we are trying to compare two such similar yet different technologies by understanding the core architecture performance comparisons and different use cases.
Understanding Kafka On Architectural Level
Apache Kafka acts as a messenger sending messages from one application to another. Messages sent by the producer(sender) are grouped into a topic that is subscribed by the consumer(subscriber) as a stream of data.
Messages are further divided into small partitions called topics. According to the offset, the producer sends data to a particular topic and is read by the consumer in an ordered way. A Kafka cluster is called a broker. There are multiple replications for a broker for fault tolerance and maintenance in case the broker fails zookeeper currently elects the next broker.
Understanding Pulsar On Architectural Level
Apache Pulsar decouples its message delivery and storage layer with its multilayered architecture, making them scale parallelly. Pulsar’s take on its architecture has gained attention from a number of cloud solutions. Pulsar has a tier of stateless broker instances connecting to separate bookie nodes of bookkeeper instances that manage read, write, store, and replicate procedures. This decoupling enables each layer to scale at large and leverage elastic environments’ auto-scaling ability to adapt dynamically to incoming traffic spikes.
Pulsar instances have one or more pulsar clusters allowing them to replicate data within the cluster.
A pulsar cluster consists of brokers, bookkeepers, and zookeepers. A pulsar broker is a stateless component responsible for administrative tasks and topic lookups. It also includes a dispatcher that transfers data asynchronously with a TCP server over a custom binary protocol. Bookies/bookkeepers are used for persistent storage. Apache bookkeeper is distributed WAL (write-ahead log) that is used for persistent storage. It plays a crucial role in pulsar architecture by enabling it to use multiple independent ledgers, storage for sequential data to handle entry replication, and guaranteed ledger read consistency. Bookie design handles multiple of ledgers with concurrent reads and writes. Using multi-disk devices for journal and storage, bookies can decouple reading operations’ effects from the latency of progressing write operations.
BookKeeper contains a journal file that is used to maintain transaction files. A new journal is created once a bookie starts or an old file exceeds the threshold.
Comparing Apache Kafka to Pulsar
Architectural Comparison
Apache Kafka and Pulsar have almost a similar concept for messaging systems. The user interacts with the system through topics that are split up into small equal-sized groups known as partitions. These partitions distribute data across nodes to be consumed by multiple consumers, parallelly allowing to scale up the consumptions.
The rudimentary difference between Pulsar and Kafka is the Architectural approach followed by them. KAFKA follows a partition-centric design with a monolithic architecture. Pulsar follows a multi-layered architecture design with a segment-centric approach.
In Kafka, the partitions are directly stored to the leader node, and then for fault tolerance, the data is replicated to the replica node. This design has drawbacks since the partition needs to be stored on a local disk, so the partition’s max size will be that of the disk. Secondly, since data needs to be replicated on replica nodes, the replica node’s size will be that of the minimum storage present in the disk. So once the max storage capacity is filled, the incoming message will halt until space has been made, leading to data loss. Once you have identified this issue, you are left with the option of deleting the message, and you may delete the messages that have not been consumed yet. Or you have to rebalance by recopying the data that is not an optimum solution as your entire partition is offline, and recopying is also not very fault-tolerant.
With a segment-centric approach followed by Pulsar, partitions are subdivided into segments rolled over on defined time or size and evenly distributed across bookies. This helps with redundancy and scaling. In this case, when your memory is maxed out because of segmentation, there is no need to replicate content. You need to add a new bookie. Until then, pulsar would keep writing the message segments to bookies with remaining spaces, and once bookie has added, all load shifts toward the new bookie node.
Kafka (KIP-500) – Kafka without Zookeeper
Currently, Kafka uses a zookeeper to store its metadata, but it is proposed to remove it from the current architecture, that will greatly benefit processing it. We can consider it as Kafka on Kafka. The metadata will be stored inside Kafka, hence reducing the need to manage external systems thus reducing the complexity of the overall architecture. There is no need to wait for a new broker to be elected as a standby controller can be selected and already have a state. This will also reduce the cost of topic creation and deletion from O(n) to O(1), thus increasing the write throughput.
General Comparison
License | Apache V2 | Apache V2 |
Component | Kafka + Zookeeper | Pulsar + BookKeeper + Zookeeper + RocksDB |
Message Consumption Model | PULL | PUSH |
Storage Architecture | LOG | INDEX |
Throughput and Latency Comparison from Vendors/Contributors
Features | KAFKA | PULSAR |
Throughput Comparison (peaks) | 605 MB/s | 305 MB/s |
Latency | 5 ms (200 MB/s loads) | 25 ms (200 MB/s loads) |
Level Wise Comparison
Level | Replication | Local | Operation |
1 | Sync | Sync | The system returns a write response to the client ONLY AFTER the data has been replicated to multiple (at least the majority of) locations AND each replica has been successfully fsync-ed to local disks. |
2 | Sync | Async | The system returns a write response to the client ONLY AFTER the data has been replicated to multiple (at least the majority of) locations. There is no guarantee that each replica has successfully sync-ed to local disks. |
Durability Level and Partitions Comparison
Durability Levels | Partitions | Pulsar | Kafka | |
Peak Publish + Tailing Reads Throughput (MB/s) | Level-1 Durability | 1 | 300 MB/s | 160 MB/s |
100 | 300 MB/s | 420 MB/s | ||
2000 | 300 MB/s | 300 MB/s | ||
Level-2 Durability | 1 | 300 MB/s | 180 MB/s | |
100 | 605 MB/s | 605 MB/s | ||
2000 | 605 MB/s | 300 MB/s | ||
Peak Catch-up Reads Throughput (MB/s) | Level-1 Durability | 100 | 1.7 GB/s | 1 GB/s |
Level-2 Durability | 100 | 3.5 GB/s | 1 GB/s |
Subscription, local, and Replication Durability Comparison
Partitions & Subscriptions | Local Durability | Replication Durability | Pulsar | Kafka | |
End-to-End P99 Latency (ms)(Publish + Tailing Reads) | 100 Partitions, 1 Subscription | Sync | Ack-1 | 5.86 | 18.75 |
Ack-2 | 11.64 | 64.62 | |||
Async | Ack-1 | 5.33 | 6.94 | ||
Ack-2 | 5.55 | 10.43 | |||
100 Partitions, 10 Subscriptions | Sync | Ack-1 | 7.12 | 145.10 | |
Ack-2 | 14.65 | 1599.79 | |||
Async | Ack-1 | 6.84 | 89.80 | ||
Ack-2 | 6.94 | 1295.78 |
Apache Kafka and Pulsar Pros and Cons?
Pulsar – Pros
- Apache Pulsar is rich in features such as multitenancy, Multi-DC replications, etc.
- It consists of a more flexible client API.
- Client components(java) are thread-safe and can be acknowledged by consumers from different threads.
- Can scale elastically
Pulsar – Cons
- Java clients are not very well documented.
- It is still established, so have low community strength.
- Message-Id concept tied to BookKeeper – consumers may not be able to read data sequentially.
- To read the last messages on the topic, readers need to skim through messages till the end.
- No transactions.
- With the increase in tech stack, they have higher complexity.
Kafka – Pros
- Very well documented.
- Kafka has matured over the years, and so has a larger community.
- Lesser tech components hence less complexity, are easier to deploy.
- Transactions – atomic reads writes within the topics.
- Offsets form a continuous sequence – the consumer can quickly seek to last message.
Kafka – Cons
- The consumer cannot acknowledge the message from a different thread.
- Does no provide with multi-tenancy model?
- Robust Multi-DC replication is offered in Confluent Enterprise only.
- It is not assembled to scale elastically.
Apache Kafka and Pulsar Use-Cases Pulsar Use Cases
- Pulsar is an ambidextrous stream with application messaging (traditional queuing systems) and data pipelines. Traditional queuing systems are used to enable asynchronous communications. In contrast, data pipelines are used for transferring high-volume data Core technology of pulsar makes it possible to deploy both, thus providing an ideal platform with unified messaging capabilities. Hence an organization, instead of managing different systems, can manage it in one.
- Pulsar can be used to process heavy and large volumes of data in both batch and stream processing ways. An organization that seeks to run both on its dataset can tackle the challenge of maintaining multiple systems that can reduce cost.
- With pulsar, it is easy to implement complex microservice techniques like event sourcing where application broadcast produced streams of the event into the shared messaging system, which keeps history in a centralized log, thus improving data flow and pp synchronization, while Kafka can store streams of events for days event sourcing requires larger retention this is where pulsar handles this very well.
Kafka Use Cases
- Kafka is one of the best open-source streaming applications that can be used to provide near real-time solutions, and with ksql one can also give preprocessed data in near real-time, thus reducing the overall latency.
- Kafka is very suitable for activity-tracking applications that many consumers can use.
Conclusion
By this, we can conclude both Apache Kafka and Pulsar are powerful stream processing platforms that have evolved and provided us with various services. Each has different abilities, and architecture is suitable for different cases. With different use cases, it is up to test them accordingly and choose the best configuration.