Redis streams vs. Kafka

How to implement Kafka-like semantics on top of Redis streams

24 August 2020

Kafka is known for solving large-scale data processing problems and has been widely deployed in the infrastructure of many well-known companies. Back in 2015, LinkedIn had 60 clusters with a total of 1100 brokers processing 13 million messages per second.

But it turns out that scale is not the only thing Kafka is good at. The programming paradigm it promotes — partitioned, ordered, event processing — is a good solution to many problems you are likely to face. For example, if events represent rows to be indexed into a search database, it's important that the last modification is the final one indexed, otherwise searches will return stale data indefinitely. Similarly, if events represent user actions, processing the second one ('user upgraded account') might rely on the first ('user created account').

This paradigm is different to traditional job queue systems, in which events are popped off a queue by many workers simultaneously, which is simple and scalable, but which destroys any ordering guarantees.

Let's say you want ordered processing, but perhaps you don't want to use Kafka due its reputation as a heavy-duty system that's difficult or expensive to operate. How does Redis compare, now that it has the 'stream' data structure, released in version 5.0? Does it solve the same problem?

The Kafka architecture

Let's start with the basic architecture found in Kafka. The fundamental data structure is the topic. It's an append-only sequence of records ordered by time. The benefits of using this data structure are well-described in the now-classic blog post The Log, by Jay Kreps.

Topics are partitioned to enable them to scale: each one could be hosted on separate Kafka instances. The records in each partition are assigned sequential ids, known as offsets, that uniquely identify each record within the partition. A consumer processes records sequentially, keeping track of its last-seen offset. Since records are persisted in a topic, multiple consumers can process records independently of one another.

In practice, you will likely distribute your processing across many machines. To enable this, Kafka offers a 'consumer group' abstraction, which is a set of processes that cooperate to consume data from a topic. A topic's partitions are divided among the members of the group. Then, when members join or leave the group, the partitions must be reassigned so that each member receives a fair share of the partitions. This is known as the rebalancing algorithm.

Note that a single partition is only ever processed by a single member of the consumer group. (But a single member may be responsible for multiple partitions.) This enables the strictly ordered processing guarantee.

This set of tools is very useful. You can scale your processing easily by adding more workers and Kafka takes care of the distributed coordination problems.

The Redis stream data structure

How does the Redis 'stream' data structure compare? A Redis stream is conceptually equivalent to a single partition of a Kafka topic described above, with small differences:

It is a persistent, ordered store of events (same as in Kafka)
It has a configurable maximum length (vs. a retention period in Kafka)
Events store keys and values, like a Redis Hash (vs. a single key and value in Kafka)

The major difference is that consumer groups in Redis are nothing like consumer groups in Kafka.

In Redis, a consumer group is a set of processes all reading from the same stream. Redis ensures that events will only be delivered to one consumer in the group. For example, in the diagram below, Consumer 1 will not process '9' next — it will skip over it since Consumer 2 has already seen it. Consumer 1 will be given the next event that has not been seen by any other group members.

The group acts as a way to parallelise the processing of a single stream. If you're thinking that looks a lot like a traditional job queue architecture, you're correct. It therefore loses the ordering guarantee that is central to stream processing, which is rather unfortunate.

Stream processing as a client library

So how can we build a stream processing engine on top of Redis if it only offers effectively a single partition of a topic with job-queue semantics? Well, if you want Kafka's features, you need to build them yourself. This means implementing:

1. Event partitioning. You'll need to create N streams and consider each one a partition. Then, at send time, you need to decide which partition should receive it, presumbly based on a hash of your event or one of its fields.

2. A worker-partition assignment system. To scale out and support multiple workers, you'll need to create an algorithm to distribute partitions amongst them, ensuring that each worker owns a mutually exclusive subset of them, i.e. an equivalent to Kafka's "rebalance" system.

3. In-order processing with acknowledgement. Each worker needs to iterate over each of its partitions, keeping track of its offsets. Even though Redis consumer groups have job-queue semantics, they can help here. The trick is to use a single consumer per group and then create a single group per partition. Then every partition will be processed in-order and you can take advantage of the built-in consumer group state tracking: Redis can track not only offsets, but per-event acknowledgement, which is quite powerful.

This is the absolute minimum required. If you want your solution to be robust, you'll probably also want to think about error handling: in addition to crashing your worker, maybe you'll want a mechanism to forward errors to a "dead-letter" stream and continue processing.

The good news — if you're into Python — is that I've solved these problems and more in a newly released library called Runnel. You are welcome to check it out if you want to benefit from Kafka-like semantics on top of Redis. Here's how it looks, basically identical to one of the Kafka diagrams above:

Workers coordinate their ownership of partitions via locks implemented in Redis. They communicate with each other via a special 'control' stream. For more information, including a detailed breakdown of the architecture and rebalance algorithm, see the Runnel docs.

Trade-offs

Is Redis a good choice for large-scale event processing? There's a fundamental trade-off: since everything is in-memory, you gain unparalleled processing speed, but it's not suitable for storing unbounded amounts of data. With Kafka you might be willing to persist all your events indefinitely¹, but with Redis you are certainly going to store a fixed window of recent events — only enough for your processors to have a comfortable buffer in case they slow down or crash. This means you'll probably also want to use an external long-term event store such as S3, for example to be able to replay them, which adds complexity to your architecture but reduces cost.

My main motivation for working on this problem was the ease-of-use and low cost involved in deploying and operating Redis. That's why it's attractive vs Kafka. It's also a fantastic set of tools that has stood the test of time magnificently. It turns out that with some effort, it can support the distributed stream processing paradigm too, so check out Runnel if you're interested.

Confluent has promoted this idea since they want Kafka to be a central source of truth, but in practice it's extremely expensive vs. using S3 for long-term storage. ↩