Introduction
Recently, we published a blog post that previewed Dragonfly Cluster1,
our horizontally scalable solution designed to handle large in-memory workloads—if a single 1TB Dragonfly instance is not enough for you.
If you're familiar with Redis Cluster, you might notice that, at first glance, the architecture appears quite similar.
Both systems share certain limitations as well—commands like SELECT
are restricted,
and multi-key operations are only permitted when the multiple keys are all located in the same hash slot.
When building Dragonfly Cluster, we recognized the value of borrowing proven ideas from existing solutions like Redis Cluster. However, we also understand that changes without purpose lead to unnecessary complexity. With Dragonfly Cluster, our focus is on overcoming the key architectural limitations of Redis Cluster to enable superior scalability. In this blog post, we'll explore the similarities between the two systems while highlighting the critical differences that allow Dragonfly Cluster to scale more efficiently.
Redis Cluster - The Gossip Protocol Approach
Redis Cluster is widely praised for its simple and efficient design, which relies on the gossip protocol to manage distributed instances (or nodes). This decentralized approach is elegant in that every Redis primary instance has the same role and performs the same tasks to maintain the overall cluster topology. The gossip protocol ensures that cluster instances continuously propagate each other with information about the cluster's health and structure.
So, how does gossiping actually work? In Redis Cluster, each node exchanges state information with a subset of other nodes. Over time, this mechanism makes all nodes aware of the entire cluster. This method helps avoid a single point of failure and spreads the load of maintaining the cluster evenly across all instances.
However, Redis Cluster has a practical limit of around 1000 nodes, though cloud providers like AWS impose stricter limits, with a default of 90 nodes and a maximum of 500 as of August 2020. This limitation comes from the very mechanism that makes it elegant—the gossip protocol. As the cluster grows, nodes spend more and more time exchanging cluster information rather than processing actual data requests. This leads to a bottleneck where nodes become so busy communicating with each other that they struggle to perform reads and writes efficiently.
Gossip Protocol - Message Size
To illustrate this, let's first take a closer look at the size of Redis Cluster gossip protocol messages.
In Redis Cluster, nodes continuously exchange PING
and PONG
messages—collectively known as heartbeat messages—that carry essential cluster configuration information.
Each instance sends a PING
message composed of the clusterMsgDataGossip
structure, which includes fields like nodename
(40 bytes), ip
(46 bytes), and other metadata, resulting in a total of 106 bytes as shown below:
// Redis v7.4.0
typedef struct {
char nodename[CLUSTER_NAMELEN]; // 40 bytes
uint32_t ping_sent; // 4 bytes
uint32_t pong_received; // 4 bytes
char ip[NET_IP_STR_LEN]; // 46 bytes
uint16_t port; // 2 bytes
uint16_t cport; // 2 bytes
uint16_t flags; // 2 bytes
uint16_t pport; // 2 bytes
uint32_t notused1; // 4 bytes
} clusterMsgDataGossip;
Each PING
message contains not only the state of the sending instance but also information about 1/10th of the cluster's instances.
In a 1000-instance cluster, that's about 100 other instances.
Here's how we can calculate:
- For each instance, we include a struct as above: sizeof(clusterMsgDataGossip) = 106 bytes.
- Information of 100 other instances plus the sending instance itself: 106 x 101 = 10.71KB.
- Additionally, the slot mapping information needs to be transferred. Redis Cluster has a max of 16,384 slots, which consumes 2048 bytes (2KB) as a bitmap.
- In total, each
PING
message consumes around 10.71 + 2 = 13KB.
Once PING
ed, an instance sends a corresponding PONG
message of the same structure and thus the same size, bringing the total traffic for a single heartbeat exchange to 26KB.
While 26KB may not seem large, it becomes significant when compared to typical Redis data requests, which are often just a few KB.
As the cluster scales, the volume of PING/PONG
traffic increases, consuming a portion of the network bandwidth and reducing the cluster's capacity to handle actual read or write requests for in-memory data.
Gossip Protocol - Heartbeat Frequency
In terms of the frequency, typically, a node sends PING
messages to random nodes at regular intervals (by default sending to 5 nodes per second), triggering PONG
replies.
In addition, each node ensures it pings every other node
from which it hasn't received recent information within half the NODE_TIMEOUT
duration—that is, every 7.5 seconds, as NODE_TIMEOUT
defaults to 15 seconds.
# Redis v7.4.0
#
# Cluster node timeout is the amount of milliseconds a node must be unreachable
# for it to be considered in failure state.
# Most other internal time limits are a multiple of the node timeout.
cluster-node-timeout 15000
In a 1000-node cluster, each node will attempt to send PING
messages to the 999 other nodes every 7.5 seconds.
This results in approximately:
- 999 / 7.5 = 133.2 heartbeats per second per node.
- It amounts to 133.2 x 1000 = 133,200 heartbeats per second across the entire cluster.
While these heartbeat messages are distributed evenly among the nodes, the volume becomes significant in large clusters with the default NODE_TIMEOUT
setting.
The constant inter-node communication consumes considerable network bandwidth and processing power, impacting overall cluster performance by reducing the capacity to handle client requests.
Although there are ways to lower the number of messages—such as increasing the NODE_TIMEOUT
value or optimizing the heartbeat protocol—the default settings can lead to substantial overhead in very large clusters.
This highlights a limitation of the gossip-based system in scaling efficiently.
Essentially, when running Redis Cluster on a large scale, instances (or nodes) would be busy exchanging cluster information, leaving less CPU time and network bandwidth for in-memory data manipulation requests.
Dragonfly Cluster - Orchestrated Scalability
Dragonfly Cluster shares some architectural similarities with Redis Cluster, particularly in terms of sharding and slot assignment strategies. Like Redis, each Dragonfly node contains all the necessary cluster information. However, a key distinction is that Dragonfly instances don't rely on gossiping to communicate with each other.
Redis Cluster's gossip protocol is more akin to choreography, where each instance independently manages communication with other instances. While this can work well for smaller clusters, it becomes less efficient as the system scales.
In contrast, Dragonfly Cluster uses a centralized control plane (which can be multiple nodes with high availability itself) that stores the cluster information, maintains the cluster topology, and manages inter-instance coordination. This approach is similar to orchestration, where a central authority directs and manages communication between services. The control plane ensures smooth cluster-related operations without requiring constant communication between Dragonfly instances, which is more suitable for handling complex, large-scale deployments.
Specifically, the control plane is responsible for monitoring two critical inter-instance communication scenarios: slot migration and primary replica creation, which is explained in detail in the previous post. In both cases, inter-instance communication is initiated and monitored by the control plane, reducing the complexity at each primary instance level.
It is notable that a single Dragonfly instance is already highly capable and vertically scalable, handling significant workloads without needing to expend resources on network chatter or CPU cycles to maintain cluster state. By avoiding unnecessary inter-instance communication, Dragonfly Cluster brings horizontal scalability into the mix with optimized resource usage for network, CPU, and memory, allowing instances to focus on serving data efficiently at massive scale.
While we haven't tried this ourselves yet, we are envisioning scalability for an in-memory data store system at multiple terabyte or even petabyte levels, given that a single Dragonfly instance can handle up to 1TB of workload, and scaling to hundreds or even thousands of instances within a Dragonfly Cluster should not be problematic.
Conclusion
In this post, we explored the architecture of Redis Cluster and Dragonfly Cluster, drawing parallels to communication strategies in microservices. Redis Cluster employs a gossip-based approach similar to choreography, where each instance independently manages its communication with others. While effective for smaller clusters, this method introduces overhead as the system scales. On the other hand, Dragonfly Cluster's architecture is more similar to orchestration, where a centralized but highly available control plane manages communication and coordination between instances. This approach enables better scalability and resource optimization, making it ideal for large, complex deployments.
We believe that we've chosen the best architecture for both stand-alone Dragonfly instances and Dragonfly Cluster. A single Dragonfly instance employs a multi-threaded shared-nothing architecture, which is able to fully utilize the multi-core server resources and ensures atomicity without heavy locking. The introduction of Dragonfly Cluster allows for horizontal scalability without the limitations seen in gossip-based systems.
To learn more, we invite you to explore Dragonfly for your workloads. Whether you're optimizing small-scale applications or preparing for future multi-terabyte workloads with Dragonfly Cluster, Dragonfly offers exceptional performance and scalability. Sign up for Dragonfly Cloud to experience the powerful managed service with configurable resources and high availability, or join our community to stay updated on the progress of Dragonfly Cluster and other exciting features.
Footnotes
-
Note that Dragonfly Cluster is still in preview. However, many features and cluster-related commands described in this article are already available in the Dragonfly main branch. We are actively testing and improving this amazing feature. ↩