Introduction
Redis is widely adopted by teams and organizations because of its speed and simplicity. Its in-memory nature allows for lightning-fast data manipulation, making it an ideal fit for applications that need low-latency responses. However, Redis's single-threaded design for handling data operations often means that teams quickly encounter scaling challenges, sometimes at an early stage in the lifecycle of an application.
While Redis provides some flexibility for scaling, scaling it properly isn't always straightforward. Many teams hit roadblocks, particularly as they try to manage more traffic and larger datasets. Often, decisions made in haste lead to inefficiencies or even performance degradation, which becomes increasingly problematic as applications grow.
In this blog post, we'll highlight some of the most common mistakes made when scaling Redis. We'll also discuss the consequences of these missteps, helping you understand why they happen and what you can do to avoid them. By the end, you should have a clearer view of how to approach Redis scaling in a way that supports your application's growth without falling into common pitfalls.
Common Mistakes & Pitfalls in Scaling Redis
1. Vertical Scaling
One of the most common misconceptions when scaling Redis is the belief that vertical scaling—adding more CPU, memory, or other resources—will resolve performance issues. While this might work in other systems, Redis' single-threaded nature for handling in-memory data operations limits the benefits of vertical scaling. Even though adding memory or CPU power can improve background tasks like AOF rewriting and snapshotting, it won't significantly enhance the throughput of the core read and write operations.
Because Redis only uses a single thread for these data operations, increasing CPU cores does little to help manage the increasing traffic or larger datasets. The result is that teams often see diminishing returns: the cost of upgrading to larger servers does not correspond with improved performance. In the worst-case scenario, companies can spend heavily on infrastructure without seeing the expected gains, leading to frustration and wasted resources.
2. Read Replicas
Many teams turn to read replicas as a way to scale Redis, assuming they can handle both read and write workloads effectively. However, this can lead to several key mistakes.
One common misconception is that replicas can directly improve both read and write performance. While adding replicas can help offload some of the read traffic from the primary node, they do little to address write-heavy workloads. Since only the primary handles writes, increasing the number of replicas won't reduce the write load, leading to potential bottlenecks when the write traffic grows beyond the capacity of the primary node. This can result in degraded write performance, slowing down overall system throughput.
Another pitfall is adding too many replicas. While it might seem like a good way to distribute read traffic, too many replicas can lead to what's known as a replication loop. Multiple replicas request snapshots and continuous streamed data from the primary at the same time, overloading it with synchronization tasks. If the primary node fails and a failover occurs, this loop can repeat, overwhelming the new primary as well, causing cascading performance issues and non-responsive instances across the system.
Teams also frequently make the mistake of under-provisioning the primary node. As mentioned above, the primary not only handles all the writes but also handles replications, both the initial snapshotting and continuous streaming. If the primary node doesn't have enough resources, this can lead to delays in replication, increasing the likelihood of data inconsistency or even failed synchronization during high-traffic periods.
On the other hand, over-provisioning replicas can lead to a different set of problems. In an attempt to maintain consistency, teams often equip replicas with the same specifications as the primary node, resulting in over-provisioning. This wastes resources, driving up infrastructure costs without delivering significant performance gains, as the replicas aren't doing the heavy lifting of handling write operations.
3. Manual Sharding or Proxy Layer
When teams begin to saturate a single Redis instance, many turn to manual sharding or use a proxy layer. Using a consistent hashing algorithm in the backend application code may seem like all that's needed to evenly distribute data across multiple nodes. However, this oversimplifies the complexity of managing a distributed system.
One of the major pitfalls with manual sharding or using proxy-based solutions developed as early pioneers, such as Twemproxy, is the lack of built-in failure recovery and automatic scaling mechanisms. For instance, adding or removing nodes in a Twemproxy cluster can be very challenging without restarting the entire system, which may lead to downtime during critical moments. Configuration changes require manual updates and restarts as well, further compounding the risk of errors and inconsistencies.
As a result, teams relying on these approaches often face high operational overhead. Node failures may require manual handling, which not only introduces downtime but also increases the chances of data inconsistency. These issues are quite common in early distributed systems that don't prioritize seamless scaling and failover, ultimately causing problems for operations teams and making the system more fragile than intended.
4. Redis Cluster
As organizations grow their infrastructure, many look to Redis Cluster as the ultimate scaling solution. However, there are several common misconceptions that can lead to operational challenges when moving from a stand-alone Redis instance to a cluster.
One key mistake is assuming that Redis Cluster behaves just like a stand-alone Redis instance. While the fundamental API remains the same, Redis Cluster introduces several important limitations. For instance, multi-key commands are not supported unless all the keys involved reside in the same hash slot. Additionally, logical databases (the ability to divide data into different namespaces) are disabled in Redis Cluster, which may disrupt existing workflows. Finally, teams will need to use a special version of the client SDK to interact with Redis Cluster. These differences can cause incompatibility issues if teams don't fully adapt their operations and applications to the Redis Cluster architecture, leading to failed operations and frustrated developers.
Another frequent misunderstanding is related to the gossip protocol used by Redis Cluster to manage communication between nodes. The gossip protocol requires constant information exchange between primary nodes to retrain the cluster topology. While this might seem efficient at first, as the cluster scales—particularly to hundreds of nodes—the resource costs (e.g., bandwidth, CPU usage) of this communication can increase dramatically. This overhead can lead to higher latency and lower throughput for individual nodes within the cluster, which effectively limits the overall cluster performance and sets an upper limit of the number of nodes.
How to Avoid Scaling Mistakes & Pitfalls
Founding members of the Dragonfly team have spent the past years tackling Redis scalability challenges and concluded that some pitfalls can be avoided by following best practices, but some others cannot be solved by operational efforts. This is why Dragonfly was built—to help developers avoid these common pitfalls and achieve more robust backend systems.
Single Instance with Dragonfly
With Dragonfly, you can scale vertically to take full advantage of powerful hardware, reaching the limits of 64 cores and hundreds of GB to 1TB of memory. A single instance can handle large workloads without needing to prematurely shift to more complex setups.
Read Replicas with Dragonfly
Dragonfly's single instance is designed to be robust and efficient, enabling a strong primary-replica topology (e.g., 1 primary, 2 replicas using Sentinel) without the common issues found in Redis. Notably, Dragonfly also eliminates the replication storm problem, as the memory usage doesn't spike during snapshotting, ensuring more balanced operations.
Manual Sharding or Proxy Layer
We don't recommend using manual sharding or proxy layers. These setups, often from early designs, are limited in functionality and can create operational headaches. Dragonfly's architecture offers more seamless scalability without relying on such outdated systems.
Dragonfly Cluster
Dragonfly Cluster avoids the pitfalls of Redis Cluster by eliminating the gossip protocol in favor of a more orchestrated approach. This results in much less inter-node communication and thus a clustering system with a much higher upper limit in terms of QPS and the amount of in-memory data being managed.
While we've improved many aspects of the distributed architecture, Dragonfly Cluster shares the same limitations as Redis Cluster when it comes to multi-key operations and logical databases. We want to be transparent about these constraints so that developers can make informed decisions when scaling with Dragonfly Cluster.
Conclusion
Scaling Redis can be a complex and challenging task, with many teams and organizations falling into common pitfalls that can impact performance, reliability, and costs. By understanding these mistakes—whether in vertical scaling, read replicas, manual sharding, or Redis Cluster—you can make more informed decisions.
With Dragonfly, many of these issues are addressed directly, offering a more efficient, robust path to scaling. Whether you're relying on a single instance or a cluster, Dragonfly's new, thoughtful, and robust design helps developers and platform engineers avoid these scaling traps, so you can focus on growing your application, not managing infrastructure hurdles.
If you're ready to experience these benefits firsthand, try out the Dragonfly Community Edition or explore Dragonfly Cloud, where you can scale the in-memory workloads effortlessly with our managed solution. Both options use the same Dragonfly source-available core code and offer the performance, simplicity, and reliability needed for your applications.