Dragonfly

Dragonfly vs. Valkey Benchmark: 4.5x Higher Throughput on Google Cloud

Dragonfly outperforms Valkey in throughput, performance stability, and memory efficiency, scaling linearly with CPU cores and excelling in high-concurrency, CPU-intensive workloads.

March 4, 2025

Dragonfly vs. Valkey Benchmark: 4.5x Higher Throughput on Google Cloud Platform Cover

We haven’t released benchmarks comparing Dragonfly to similar in-memory data stores in a while. This isn’t due to a lack of interest from the community—proper benchmarking simply takes significant time and effort. Conducting accurate and meaningful performance evaluations requires careful setup, rigorous testing, and thorough analysis to ensure reliable results.

Now that Dragonfly Cloud is running on all three major cloud providers, we decided to benchmark and compare it with other services in each cloud. We start by comparing Dragonfly and Valkey OSS on Google Cloud. In an upcoming post, we’ll look at how Google Memorystore for Valkey compares to Dragonfly Cloud, focusing on both services from a black-box, managed perspective.

Dragonfly Architecture

Before diving into the results, it’s worth highlighting how Dragonfly fundamentally differs from Redis-like architectures such as Valkey. Dragonfly server architecture is designed around several core principles:

  1. Sharding and Parallel Processing
    The in-memory dataset is divided into independent shards, each assigned to a dedicated thread. This eliminates the single-thread bottleneck in Redis/Valkey and allows Dragonfly to scale with additional CPU cores.
  2. Minimal Locking and Synchronization
    In a shared-nothing approach, each shard’s keys are managed by a single thread, minimizing locking overhead. This design significantly boosts efficiency under high concurrency.
  3. Asynchronous Operations and Responsiveness
    Leveraging io_uring under the hood, Dragonfly runs long-running tasks (like snapshotting) asynchronously. This keeps the data store responsive and avoids disruptions, even under heavy load.
  4. Abstract I/O and Efficient Fibers
    Dragonfly employs a custom framework that uses stackful fibers within each thread, making I/O operations more efficient and ensuring better concurrency handling.
Dragonfly Threading Model

Dragonfly Threading Model (Simplified)

In the figure above, imagine a Dragonfly server process spawning multiple threads, each of which can handle both client I/O and data shard operations. The CPU time in each thread is shared among multiple fibers that allow asynchronous execution, much like Node.js or Python’s async event loop. In this example, the Dragonfly server process spawns four threads, where threads 1 through 3 handle I/O (i.e., manage client connections) and threads 2 through 4 manage data shards. Thread 2, for example, divides its CPU time between handling incoming requests and processing data operations on the shard it owns. In general, any thread can have numerous responsibilities requiring CPU time. Data management and connection handling are just two such examples.

For the benchmarks in this blog post, we used the following command to run Dragonfly v1.26.1:

./dragonfly --logtostderr --dbfilename='' --conn_use_incoming_cpu

Running like this, Dragonfly uses all the available vCPUs on a machine.

Valkey: The Redis Fork

Valkey, the open-source fork of Redis, recently made headlines due to performance enhancements that were merged into the project after the fork. One such improvement optimized the I/O component by offloading networking and parsing into secondary threads. This allows the main thread to focus solely on data store operations. However, these improvements still do not enable Valkey to operate on the data store from multiple threads. While I/O offloading indeed helps, it does not solve the core issue of Redis/Valkey being limited by its main thread. Alternatively, Dragonfly scales vertically to any number of CPUs.

In the tests below, we used the following command to run Valkey v8.0.2:

./valkey-server --io-threads $IO_THREADS --save '' --protected-mode no

With the command above, Valkey uses several I/O threads in addition to the main thread handling the data store operations. We used IO_THREADS=10 on all servers. We’ve not seen much improvement by further increasing the number of I/O threads, which makes sense, as Valkey’s biggest bottleneck is the main thread that cannot be offloaded any further.

Google Cloud Platform (GCP)

As a leading cloud provider, GCP needs no introduction. Historically, we’ve run our benchmarks on AWS because of its strong networking capabilities. However, GCP recently released its fourth-generation line of servers, which show impressive networking improvements. Thus, we’ve benchmarked in-memory data stores on three sizes of GCP C4 instances: 16 vCPUs, 32 vCPUs, and 48 vCPUs. To eliminate client-side bottlenecks, the load test was run from a larger machine. All machines run the Linux kernel v6.8.

The dfly_bench Tool

For our benchmarks, we used dfly_bench, a load-testing program developed by our team for our internal benchmarking needs. It’s very similar to memtier_benchmark, as we tried to keep the runtime flags similar. Its relatively small footprint enables us to quickly add new features. For example, in order to perform the ZADD benchmark below, we added support for the __score__ macro. We’ve also seen it reach higher throughput rates for the same setup by being slightly more efficient. While the last point is less relevant for testing Valkey, it’s very important for Dragonfly—we do not want our load-test program to become a bottleneck due to the extremely high throughputs that Dragonfly reaches.

The sections below cover the two types of test traffic we sent to our servers.


The SET/GET Mixed Test

The mixed test suite consists of 20 minutes of SET/GET traffic using the following command:

dfly_bench --ratio 1:1 --proactor_threads $THREADS -c $CONN_PER_THREAD -d 64 --key_maximum 200000000 --test_time=1200 --qps=0

In the command above, THREADS and CONN_PER_THREAD are variables specific to the server type and size. For Dragonfly, we chose higher numbers for THREADS and CONN_PER_THREAD. For Valkey, we had to choose lower numbers in order not to overload the server. Similarly, for the 32 vCPUs test, we increased the THREADS and CONN_PER_THREAD parameters for Dragonfly in order to send higher throughput to Dragonfly. It is also notable that the parameters were chosen in such a way that the P99 latency was less than 0.5 ms. See the detailed benchmark parameter table at the end of this blog post.

Dragonfly vs. Valkey - The SET/GET Mixed Test

Dragonfly vs. Valkey - The SET/GET Mixed Test

Valkey significantly improved the throughput for use cases like SET/GET, consistent with the results from the original Valkey performance post. We observe roughly a 3-4x improvement compared to a pure single-threaded setup. This is a great result for Valkey!

These tactical optimizations, as good as they are, still cannot fully utilize the underlying hardware. As a result, we see that Dragonfly has 2.4x more throughput than Valkey on 16 vCPUs and 4.5x more throughput on 48 vCPUs.

Beyond throughput, our benchmark tests revealed critical insights into performance stability and memory efficiency for both data stores, which are equally important for real-world applications.

Performance Stability

Below, we captured the throughput time series graphs for both Valkey and Dragonfly, showing their performance under the mixed SET/GET test with 32 vCPUs. Note that we are using a stacked time series graph in Grafana to visualize the data. In this representation, the lower curve corresponds to the GET throughput, while the higher curve represents the SET throughput. Since the graph is stacked, the SET throughput is layered on top of the GET throughput, meaning the height of the SET curve also effectively reflects the total throughput.

For the throughput graph during the test for Valkey, it looked like this:

Valkey RPS Throughput (SET/GET Mixed, 32 vCPUs, Stacked Time Series)

Valkey RPS Throughput (SET/GET Mixed, 32 vCPUs, Stacked Time Series)

And it was like this for Dragonfly:

Dragonfly RPS Throughput (SET/GET Mixed, 32 vCPUs, Stacked Time Series)

Dragonfly RPS Throughput (SET/GET Mixed, 32 vCPUs, Stacked Time Series)

As you can see, the throughput graphs reveal notable differences in performance stability between the two systems. Valkey’s graph shows periodic performance degradations characterized by sharp "valleys" caused by global hashtable rehashing. This process significantly impacts server performance, causing noticeable drops in throughput. In contrast, Dragonfly demonstrates a more stable performance profile with minimal fluctuations throughout the test.

Memory Efficiency

Dragonfly also outperforms Valkey in terms of memory efficiency. During the benchmark, Dragonfly utilized only 17GiB of memory compared to Valkey’s 24.5GiB. Despite this lower memory usage, Dragonfly managed to store 198 million items, whereas Valkey stored 177 million. This translates to a 38% reduction in memory usage per item for Dragonfly compared to Valkey. The difference in the number of items stored results from the varying write throughput maintained by each server during the load test.

The ZADD Test

ZADD is a Redis command used to add multiple elements to a sorted set, a data structure that stores unique elements while maintaining their order based on associated scores. In this test, we send commands, with each adding 256 members to a sorted set entry. Sorted sets are inherently CPU-intensive because they require maintaining the order of elements through dynamic data structures like skiplists in Redis/Valkey or balanced trees in Dragonfly, which involve frequent updates and rebalancing operations. These operations become more demanding as the number of elements grows, making sorted sets one of the more computationally expensive data structures in the Redis API. We ran the test using the following command:

./dfly_bench --command "$COMMAND" --key_maximum 100000000 -n 70000 -c $CONN_PER_THREAD --proactor_threads $THREADS --qps=0 -d 8

Where the COMMAND argument is defined like this:

export COMMAND="zadd __key__ __score__ __data__ __score__ __data__ ... __score__ __data__"

With 256 __score__ __data__ pairs:

  • __key__ generates a random key.
  • __score__ generates a random double.
  • __data__ generates a random string of predefined length, which is 8 in this case. 

Below are the results of sending such traffic to both Valkey and Dragonfly.

Dragonfly vs. Valkey - The ZADD Test

Dragonfly vs. Valkey - The ZADD Test

The results clearly demonstrate that Valkey is bottlenecked by its main thread for CPU-intensive operations like ZADD, while Dragonfly scales almost linearly, reaching 29 times higher throughput on the 48 vCPU server.

Dragonfly not only scales nearly linearly with the number of vCPUs, it also demonstrates significantly greater memory efficiency compared to Valkey. For instance, in the ZADD test with 48 vCPUs, Dragonfly used only 12.6 KiB per sorted set entry, whereas Valkey required 23.1 KiB. This represents a 45% reduction in memory usage for Dragonfly, achieved through our optimized sorted set implementation based on a B+ tree. You can explore more details about this implementation in our sorted set blog post.

Dragonfly’s Performance Advantage

These benchmarks highlight the significant performance advantages of Dragonfly’s multi-threaded architecture compared to Valkey, particularly in scenarios with high concurrency and CPU-intensive operations. Dragonfly’s ability to scale linearly with the number of vCPUs, combined with its superior memory efficiency, makes it a compelling choice for demanding in-memory data storage needs. In future posts, we will explore Dragonfly’s performance on other cloud platforms and compare Dragonfly Cloud with managed in-memory data store services.


Appendix | Benchmark Parameters

Below are the dfly_bench parameters that were used to run all benchmarks.

Test Name

THREADS

CONN_PER_THREAD

MIXED, 16 vCPUs, Dragonfly

24

20

MIXED, 32 vCPUs, Dragonfly

28

20

MIXED, 48 vCPUs, Dragonfly

48

20

MIXED, 16 vCPUs, Valkey

10

20

MIXED, 32 vCPUs, Valkey

12

20

MIXED, 48 vCPUs, Valkey

12

20

ZADD, 16 vCPUs, Dragonfly

10

12

ZADD, 32 vCPUs, Dragonfly

14 

12

ZADD, 48 vCPUs, Dragonfly

18

12

ZADD, 16 vCPUs, Valkey

2

12

ZADD, 32 vCPUs, Valkey

2

12

ZADD, 48 vCPUs, Valkey

2

12

Dragonfly Wings

Stay up to date on all things Dragonfly

Join our community for unparalleled support and insights

Join

Switch & save up to 80% 

Dragonfly is fully compatible with the Redis ecosystem and requires no code changes to implement. Instantly experience up to a 25X boost in performance and 80% reduction in cost