How We Optimized Dragonfly to Get 30x Throughput with BullMQ

Howdy! I am Shahar, a software engineer at DragonflyDB. Today, I'm thrilled to share an exciting journey we embarked on — optimizing Dragonfly for BullMQ, which resulted in a staggering 30x increase in throughput. While I won't be delving into code snippets here (you can check out all the changes on our GitHub), I'll take you through the high-level optimizations that made this achievement possible. We believe these enhancements not only mark a significant milestone for Dragonfly but also represent great performance benefits for users of BullMQ.

So, if you're interested in the behind-the-scenes of database performance tuning and what it takes to achieve such a massive throughput improvement, you're in the right place. Let's get started.

Introduction

BullMQ is a popular, robust, and fast Node.js library for creating and processing background jobs that uses Redis as a backend. Redis backends are a common choice for frameworks, providing low latency, comprehensive data structures, replication, and other useful features. However, applications using Redis are limited by its throughput. Dragonfly is a drop-in replacement for Redis designed for much higher throughput and easier deployments.

In a previous blog post, we announced the full compatibility of Dragonfly with BullMQ and showed how to run BullMQ with Dragonfly efficiently in a few simple steps. In this post, we'll elaborate the optimizations we made to achieve the stunning 30x performance improvements.

Benchmark Baseline

Before we optimize anything, it's essential to establish a reliable benchmarking baseline. For BullMQ, which offers its users a robust queue API for adding and processing jobs, our focus was on optimizing the throughput for the add-job operation. Collaborating closely with the BullMQ team, we developed a benchmarking tool. This tool is tailored to measure the rate at which messages can be added to queues per second — a crucial metric for understanding the performance of BullMQ while running on different backends.

Our Benchmarking Approach

We focus on adding messages (i.e., add-job operations). The rationale is straightforward: additions are less influenced by the fluctuating state of the queue. Unlike reading from queues, which can be delayed if queues are temporarily empty, adding messages offers a more stable and measurable performance indicator.
To push Dragonfly to its limits, we employed Node.js's worker threads, enabling us to leverage real OS threads to run on multiple CPUs in parallel. This approach simulates a high-load environment more effectively than a single-threaded setup.
Another key aspect of our testing environment was the hardware configuration. We intentionally used a more powerful machine for the client code (running BullMQ) compared to the machine running Dragonfly. This ensures that the client side is never the bottleneck, allowing us to accurately assess Dragonfly's raw performance.

Of course, if your workload requires higher throughput than shown in this post, you could use stronger machines. Dragonfly is all about scaling, both vertically and horizontally. As a baseline performance, let's benchmark how many add-jobs/sec Redis can achieve:

Backend	`add-jobs/sec`
Redis 6.2	71,351
Redis 7.2	76,773

Benchmark with Redis Cluster

Scaling in Redis typically involves setting up a Redis Cluster. Note that minor changes are needed to get the benchmark to work against the Redis Cluster, as one has to use a cluster-aware Redis client. We conducted our tests using an 8-node Redis Cluster, all hosted on a single machine equipped with 8 CPUs.

Backend	`add-jobs/sec`
Redis 6.2	71,351
Redis 7.2	76,773
Redis Cluster	194,533

As shown above, the Redis Cluster reaches much higher throughput than a single node.

Global Locks

Running BullMQ on Dragonfly doesn't work right away, but with a few steps, you can get it up and running smoothly. To grasp this integration challenge, let's look at how BullMQ uses Redis. BullMQ leverages Lua scripts for executing Redis commands efficiently. These scripts are advantageous because they reduce network latency and allow multiple commands to execute atomically.

However, the way BullMQ executes these scripts presents a hurdle. Despite Redis requiring that Lua script invocations specify all the keys used by the script, it does not enforce this requirement, and accessing such undeclared keys from Lua scripts "just works" in Redis. This flexibility in Redis contrasts sharply with Dragonfly's design.

Dragonfly's architecture is fundamentally different. It adopts a multi-threaded, shared-nothing approach, meaning keys are spread and not shared across Dragonfly threads. We have a state of the art transactional framework built on top of this architecture, which allows running operations on keys that belong to different threads.

The challenge arises with Lua scripts from BullMQ that contain undeclared keys. Dragonfly's mechanism involves locking all declared keys prior to executing a Lua script. It then strictly prohibits accessing undeclared keys during script execution, as they could be part of other parallel transactions.

Let's consider an example to illustrate the challenge with Dragonfly and undeclared keys. As shown above, imagine a transaction, Tx1, which is set to access keys key0, key1, and key2. In Dragonfly's architecture, Tx1 cannot access other keys, such as key3, because it needs to ensure they are not locked by another transaction like Tx2. Attempting to access an undeclared key like key3 could disrupt the integrity of Tx2.

Accessing undeclared keys is common, and many Redis-based frameworks rely on this practice. To accommodate this, Dragonfly can be configured to handle these scenarios. By using the --default_lua_flags=allow-undeclared-keys flag, Dragonfly treats Lua scripts as global transactions. This means the script has access to the entire datastore, locking it completely for its duration.

However, this solution comes with its own set of challenges. While it allows scripts to access all keys (including the undeclared ones), it also restricts parallelism. No other commands or scripts can run alongside a global transaction, even if they involve completely different keys. Moreover, the situation is further complicated by Dragonfly's multi-threaded nature. One might think that, using this mode, Dragonfly will have similar performance characteristics to Redis. However, because keys in Dragonfly are owned by different threads, Dragonfly has to schedule the work between their respective threads (we call these "hops"), which adds significant latency.

Here are the results of running Dragonfly with global locks:

Backend	`add-jobs/sec`	Notes
Redis 6.2	71,351
Redis 7.2	76,773
Redis Cluster	194,533
Dragonfly w/ Global Locks	7,697	Dragonfly Baseline

As demonstrated, using global locks in Dragonfly, while necessary for compatibility with certain Lua scripts, leads to a noticeable drop in throughput. However, this is just the start of our journey, let's start optimizing!

Rollback Mechanism: A Path Not Taken

One of the initial thoughts we considered was a new approach for running Lua scripts, which we called the "try-lock-rollback" mode. In this mode, Lua scripts would work normally until they attempted to access an undeclared key. When they do, Dragonfly will try to lock that key. If the lock is successful, meaning no other command or script is using the key, the script would proceed as intended.

However, the challenge arises if the lock attempt does not succeed. If we can't acquire the lock (because another command or script is using the key), Dragonfly would then roll back any changes made by the script so far. Following the rollback, it would attempt to run the script again from the beginning, this time possibly pre-locking the keys that caused the failure in the previous attempt.

This is an elegant solution that provides a generic way for running all Lua scripts (BullMQ or otherwise), but it has two major disadvantages:

Performance Impact on the Happy Path: Implementing this would require tracking every change made by every script, just in case a rollback is needed. This tracking would be necessary even for scripts that don't access undeclared keys or where undeclared keys are successfully locked. Essentially, this means slowing down the usual, rollback-free operations, which is known as the "common path" or "happy path".
Risk of Snowball Effect & Deadlocks: In scenarios where keys are frequently contended, this approach could lead to a series of rollbacks and retries, creating a bottleneck and significantly impacting performance. Under extreme conditions, this could even lead to deadlocks.

Given these considerable disadvantages, we ultimately decided not to pursue this option.

Hashtags to the Rescue

While we might not know every specific key a script will use, we do know something about their structure. In the case of BullMQ, all keys share a common string pattern. Users can define a prefix, like the default bull:, followed by a queue name. This forms a consistent prefix for all keys related to a specific queue (for instance, bull:queue:123). Our initial thought was to implement some form of prefix locking based on this pattern. However, we then considered a more refined solution: hashtags.

Hashtags are a Redis Cluster feature. They involve wrapping a part of a key in curly braces {}, which ensures that all keys with the same hashtag are located on the same cluster node. For example, keys {user1}:name and {user1}:email are guaranteed to reside on the same node, allowing them to be efficiently used together in commands or scripts.

Recognizing that BullMQ already utilizes hashtags for Redis Cluster operations, we adopted this concept for Dragonfly as well. We introduced a new server flag (--lock_on_hashtags) where Dragonfly locks based on the hashtag rather than the entire key. This approach allows us to maintain the atomicity and isolation of script executions while avoiding the performance penalties associated with global locks or the complexities of the rollback mechanism.

Implementing the hashtag-locking method in Dragonfly has several key advantages:

Ease of Integration for BullMQ: It allows BullMQ to work with Dragonfly by not specifying the exact keys which will be used, but only the queue name itself, which is always known. This simplification greatly streamlines the integration process.
Reduced Cross-Thread Coordination: By ensuring that all keys associated with a particular queue are handled by the same thread in Dragonfly, the need for cross-thread coordination is significantly diminished. We will cover more on this in the following sections.

However, there is a trade-off to consider, which is the limitation on parallelization within a single queue. While individual Lua scripts run serially, two different scripts can usually run in parallel if they involve keys managed by different threads. Under the hashtag-locking system, all keys of a specific queue are allocated to the same thread in Dragonfly. This means that parallel execution of operations within the same queue is not possible. However, we saw that all common BullMQ operations use a specific subset of keys, so they couldn't be parallelized anyway.

Backend	`add-jobs/sec`	Notes
Redis 6.2	71,351
Redis 7.2	76,773
Redis Cluster	194,533
Dragonfly w/ Global Locks	7,697	Dragonfly Baseline
Dragonfly w/ Hashtag Locks	17,403	~2.26x Dragonfly Baseline

Our first optimization gave us around 126% increase, which is a nice start. Also, it's notable that we decided to disable hashtag-locking by default, but Dragonfly users can turn it on via the lock_on_hashtags flag.

Reducing Number of Hops

Reducing Number of Hops for Commands

As mentioned a couple of times, Dragonfly is multi-threaded. We also handle incoming connections using multiple threads, where each connection is assigned a single thread randomly. This means that when BullMQ connects to a Dragonfly instance with 8 threads, it has a 1/8 chance of "landing" on the thread that owns its queue. In 7/8 of cases, the connection thread (internally called the coordinator thread) will attempt to run each command on the target thread separately. A script that tries to run 100 commands will require a "hop" to the target thread to lock the key, another 100 hops to run each of the commands, and another final hop to unlock the key. Each hop has a latency cost (as well as some minimal coordination overhead), which adds up. To mitigate that, we added a check to see if all the operations of a script are being done on a single (remote) thread. If they are, we perform a single hop to the target thread and run the script there. This turns those 100 hops, 1 per command, to 1 hop for all commands.

Backend	`add-jobs/sec`	Notes
Redis 6.2	71,351
Redis 7.2	76,773
Redis Cluster	194,533
Dragonfly w/ Global Locks	7,697	Dragonfly Baseline
Dragonfly w/ Hashtag Locks	17,403	~2.26x Dragonfly Baseline
Dragonfly w/ Reduced Hops (Commands)	53,011	~6.98x Dragonfly Baseline

With this optimization, Dragonfly achieved another 200% increase on top of hashtag-locking and reached 6.98x the baseline.

Reducing Hops Further for Scripts

After reducing hops for each command, we looked at the rest of the hops. We had 3 hops for each script invocation, no matter how many commands it issued:

Lock keys.
Run the Lua script to read/modify keys.
Unlock keys.

We then modified our Lua invocation flow to run all steps under a single hop (lock, run, unlock).

Backend	`add-jobs/sec`	Notes
Redis 6.2	71,351
Redis 7.2	76,773
Redis Cluster	194,533
Dragonfly w/ Global Locks	7,697	Dragonfly Baseline
Dragonfly w/ Hashtag Locks	17,403	~2.26x Dragonfly Baseline
Dragonfly w/ Reduced Hops (Commands)	53,011	~6.98x Dragonfly Baseline
Dragonfly w/ Reduced Hops (Scripts)	122,890	~15.97x Dragonfly Baseline

Again, by reducing the number of hops on the script level, we achieved another 132% increase, reaching 15.97x the baseline.

Connection Migration

Previously, I highlighted that, for an 8-thread Dragonfly server, the probability of a connection hitting the target thread is 1/8, and this likelihood decreases as the number of threads increases. When a connection tries to execute a script (or even a simple command) on a remote thread, it has to request that thread to run some code. Then it waits for that thread to become free, which may take some time.

To improve this situation, we've developed a connection migration mechanism. Currently, this feature is specifically tailored to BullMQ, where each queue typically doesn't share its connection with others. However, it holds potential benefits for other frameworks as well.

Migrating connections to other threads is a subtle process, as Dragonfly uses thread-local variables quite intensively, but this saves the last hop, getting us to a place where connections seamlessly use their target threads. We like this feature so much that we even enabled it by default. It could, however, be disabled by running Dragonfly with --migrate_connections=false.

Backend	`add-jobs/sec`	Notes
Redis 6.2	71,351
Redis 7.2	76,773
Redis Cluster	194,533
Dragonfly w/ Global Locks	7,697	Dragonfly Baseline
Dragonfly w/ Hashtag Locks	17,403	~2.26x Dragonfly Baseline
Dragonfly w/ Reduced Hops (Commands)	53,011	~6.98x Dragonfly Baseline
Dragonfly w/ Reduced Hops (Scripts)	122,890	~15.97x Dragonfly Baseline
Dragonfly w/ Connection Migration	189,756	~24.65x Dragonfly Baseline

Round Robin Key Placement

Here comes the final optimization we have made so far in this journey. In Dragonfly, the distribution of keys (or queues) across threads is determined by their hash values, akin to a random distribution. This approach typically ensures an even load distribution when there are many keys, as it balances the computational load across all threads.

However, consider a situation where an 8-thread Dragonfly server is managing just 8 queues. In an ideal scenario, each thread would handle one queue, leading to a perfectly balanced load and optimal performance. But due to the random nature of key distribution based on hashing, achieving such an even distribution is very unlikely. When the distribution of queues across threads is uneven, it results in inefficient use of resources: some threads may be idle while others become bottlenecks, leading to suboptimal performance.

That is exactly why we implemented a very cool feature we call "shard round-robin". By using --shard_round_robin_prefix=queue, keys that start with queue will be distributed between the threads one by one, guaranteeing a near-even distribution of workloads. This feature is relatively new, and you should note that:

Currently, this feature is only available for keys using hashtags, so in the example above, the key bull:{queue1} will use round-robin, while queue1 will not.
This feature should usually be disabled. It is useful only in cases of a small number of hashtags (like BullMQ queues) which are highly contended. If you use many keys (like in most Dragonfly use cases), do not use the feature, as it will in fact hurt performance.

By this point, we've achieved a 30x increase in throughput from the baseline, which is a huge improvement!

Backend	`add-jobs/sec`	Notes
Redis 6.2	71,351
Redis 7.2	76,773
Redis Cluster	194,533
Dragonfly w/ Global Locks	7,697	Dragonfly Baseline
Dragonfly w/ Hashtag Locks	17,403	~2.26x Dragonfly Baseline
Dragonfly w/ Reduced Hops (Commands)	53,011	~6.98x Dragonfly Baseline
Dragonfly w/ Reduced Hops (Scripts)	122,890	~15.97x Dragonfly Baseline
Dragonfly w/ Connection Migration	189,756	~24.65x Dragonfly Baseline
Dragonfly w/ Shard Round Robin	253,075	~32.87x Dragonfly Baseline

To put this chart into visualizations, it is notable that Dragonfly outperforms an 8-instance Redis Cluster on the same hardware. In the meantime, this optimization is much harder to implement using a Redis Cluster, as the key distribution is built into both Redis and Redis Cluster client libraries. Instead, users of Redis may have to move cluster slots between the nodes to enforce an even distribution.

Conclusion

In this post, we've taken a glimpse into our journey of integrating BullMQ with Dragonfly with a series of optimizations. Each step on this path was guided by our commitment to achieving exceptional performance, ensuring that Dragonfly stands ready to handle the most demanding loads from BullMQ users.

The journey has been both challenging and rewarding, leading to developments that not only benefit BullMQ but also have the potential to enhance the performance of other Redis-based frameworks. Dragonfly is committed to embracing the open-source community and broadening the ecosystem. More integrations and frameworks will be tested with Dragonfly and released in the future. As always, start trying Dragonfly in just a few steps and build amazing applications!

Appendix - Benchmark Setup & Details

Here are some technical details for those who wish to reproduce the benchmarks:

All benchmarks in this post are done on the same machine, one after the other. We chose an AWS EC2 c7i.2xlarge 8-CPU instance for running Dragonfly or Redis, and a monstrous c7i.16xlarge 64-CPU instance for running BullMQ.
Operating system: Ubuntu 23.04, Linux kernel 6.2.0
All client-side invocations use this tool with the following command, which uses 8 queues and 16 threads on the BullMQ side:

tsc index.ts && node index.js -h <server_ip> -p 7000 -d 30 -r 0 -w 16 -q 8