How We Use Control Loops to Manage Dragonfly Cloud Datastores

A control loop, specifically a closed-loop, is a continuous, non-terminating loop that monitors the actual state of a system, compares it against the desired state, and then takes corrective actions to move the actual state towards the desired state.

A simple example of a control loop is a thermostat. It is set to a preferred temperature, monitors the current temperature of the room, and then turns the heating on or off to reach the desired temperature.

This post describes how we use control loops to manage datastores in Dragonfly Cloud. Dragonfly Cloud is Dragonfly's managed service offering, built to handle the operational demands of running Dragonfly datastores at scale, ensuring a robust system for all our cloud customers.

Dragonfly Cloud Control Plane

The Dragonfly Cloud control plane is responsible for managing customers' datastores. These datastores can be deployed across multiple cloud providers and regions, with customers having the flexibility to create dedicated networks for their datastores as well.

The control plane exposes APIs for customers to create, update, and delete datastores, networks, connections, and backups. These APIs configure a datastore's desired state, including its memory, performance tier, cloud provider, region, number of replicas, security, and so on.

For each datastore, the control plane runs a control loop to ensure the system aligns with the desired state by performing tasks such as:

Provisioning and deprovisioning resources (instances, networks, firewalls, DNS, etc.)
Starting and configuring the Dragonfly process
Applying datastore configuration updates
Monitoring, failover, and recovery
Configuring a Dragonfly Cluster
Managing backups
TLS rotation & security updates
...

Each datastore node runs an 'agent' which allows the control plane to manage the node even though it's in a different network (using a tunneling proxy), such as starting and configuring the Dragonfly process.

Simplified Dragonfly Cloud Architecture

Control Loops

The control plane schedules a control loop for each datastore when the datastore is created, then unschedules the control loop once the datastore and its resources have been deleted.

Each control loop runs at a fixed 10-second interval (the 'tick rate'). On each iteration, the loop loads the desired and actual state of the datastore, then takes the required steps to move the actual state towards the desired configuration.

The alternative to using a fixed interval would be to only run the loop when there are changes to the desired or actual state. However, as described below, the actual state comes from multiple sources, so this would be complex.

The desired state of a datastore includes both the datastore configuration set by the customer and the internal configuration of each node in the datastore.

A datastore's actual state comes from a number of sources, including:

The persisted state of the datastore and its nodes, such as whether an instance has been provisioned, and information about the instance running the node
Monitors determining the health of each node (primary, replica, or cluster node)
The Dragonfly process of each node, such as the Dragonfly configuration, replication status, and cluster configuration
Cloud provider APIs, such as the status of an instance or the configuration of a firewall

A Dragonfly Cloud Datastore with Two Nodes Running on Different AZ Servers

Control Loop Requirements

When implementing control loops for Dragonfly Cloud, several key requirements must be met, as outlined below.

Non-Blocking

To ensure responsiveness, each control loop iteration must complete within the 10-second tick rate, meaning it cannot block on slow operations. For example, if the control loop were blocked, such as waiting 5 minutes for a backup to complete, it would be unable to respond to higher-priority issues like recovering unhealthy nodes or datastore updates.

As a result, all slow operations must be handled asynchronously. An iteration may start an operation, and subsequent iterations will check whether the operation has completed or failed.

Stateless

The control loop may be rescheduled to run on a different control plane server between iterations or could crash at any time. For this reason, the control loop must be stateless, meaning it cannot retain any local state between iterations. Instead, it must depend solely on the desired and actual datastore states, which are fetched from external sources.

This means the control loop doesn't know the result of the previous iteration, such as which operations completed or failed. Though by relying on the desired and actual datastore states, the control loop can infer what operations are needed. For example, if a previous attempt to provision a node failed, the next iteration will detect that the node hasn't been provisioned and will attempt to provision it again.

Idempotent

Since operations in the control loop may fail or the control loop itself could crash, every operation must be idempotent to allow safe retries in subsequent iterations. Idempotency ensures that an operation can be executed multiple times without altering the final outcome.

Some operations are naturally idempotent, such as updating a DNS record to point to a given address can be repeated multiple times and produce the same result. Others, however, require additional checks. In particular, when provisioning a datastore node, the control loop must first verify whether an instance already exists for that node.

Alerting

There can be situations that the control loop can't handle, such as operations consistently failing.

In these cases we escalate to the on-call team members to investigate. They can then either explicitly instruct the control loop what to do, such as forcefully failing over a node, or pause the control loop to take over manually, then resume the control loop once they are finished.

Control Loop Example: Datastore Failover

Let's walk through an example of how the control loop manages a datastore failover. When the control loop detects that a datastore's primary node is unhealthy and healthy replicas are available, it initiates a failover to the replica with the highest replication offset. If no healthy replicas are available, the primary node will be recovered without a failover.

To complete the failover, the replica is configured as a primary with the REPLICAOF NO ONE command. Then the firewall rules are updated to allow traffic to the promoted node and block traffic to the unhealthy node. Next, the datastore's DNS record is updated to point to the promoted node as well. And finally, the control loop persists the updated node roles, at which point the failover is complete.

If any of the above operations fails, such as updating DNS, then the failover can't be completed. Since the control loop is stateless, the next iteration won't know the previous attempt failed or which operations failed, though it will again find the primary node is unhealthy, so attempt to failover again to an available replica. Since all operations are idempotent, the control loop can safely attempt to failover again.

The control loop also monitors that replica nodes are healthy and configured correctly. So after the failover, it will find the demoted node is unhealthy, so it recovers the node as a replica. Also, if any other replicas are configured incorrectly, they'll be updated to replicate the promoted node.

If our monitoring finds the datastore is unhealthy for some timeout or that the control loop is unable to make progress, then we notify on-call to investigate.

Why We Didn't Use Kubernetes

Anyone familiar with Kubernetes will likely recognize the control loop pattern.

While many managed services use Kubernetes to manage their workloads, we instead chose to build our own control plane, whose design was influenced by Kubernetes.

Our datastore deployments don't fit well into the standard Kubernetes use case of bin packing pods onto multiple interchangeable instances. Instead, to achieve optimal performance, each datastore node is deployed to a dedicated instance that customers connect to directly (without an intermediate proxy or load balancer). The instance type is determined by the datastore's memory and performance tier. Instances are also usually deployed to a dedicated network (VPC) created by the customer for their datastores.

Even though this approach is likely possible with Kubernetes by deploying each pod to a dedicated instance with the desired type and network, it would negate many of Kubernetes' advantages. Furthermore, we would still face significant complexity, such as:

Managing Kubernetes clusters across multiple cloud providers and dozens of regions
Synchronizing state between our APIs and Kubernetes

Therefore, we decided it would be simpler to build and manage our own control plane, which also gives us a lot more flexibility and is easier to extend.

Kubernetes is indeed a very robust system for many use cases, particularly when it comes to orchestrating containers at scale. However, it wasn't the right fit for Dragonfly Cloud, especially when managing the complexities of our multi-cloud, multi-tenancy, platform-as-a-service architecture. While Kubernetes excels in many environments, the unique demands of our datastore management required a more tailored solution.

That said, we still leverage Kubernetes to run our control plane services, as discussed earlier in this blog post. For those in the community interested in running their own Dragonfly pods on Kubernetes, we provide a Dragonfly Kubernetes Operator as well.

Try Dragonfly Cloud

By leveraging custom control loops and a purpose-built control plane, Dragonfly Cloud can deliver a robust, flexible, and user-friendly experience for managing your Dragonfly datastores.

Ready to experience the power of Dragonfly Cloud for yourself? Sign up today and see how we simplify datastore management while ensuring top-tier performance at scale.