High Availability in Cloud-Based In-Memory Data Stores

Introduction

Suppose you have a database (or a data store) for your e-commerce application deployed in the cloud. The database is providing great throughput, customers are happy, and thus, you're happy. New Year's Eve is approaching, and you're working day and night to ensure the application runs smoothly for your customers. The day arrives, and at first, you're excited to see the sales graph reaching new heights.

You're practically daydreaming about expanding the business when, suddenly, a series of alerts starts flooding in from your monitoring stack. You quickly check the dashboard, and with sheer horror, you realize the database instance you've relied on is down! After investigating the issue, you discover that the entire zone where the database resided has gone down due to a cloud failure. No matter how many times you try restarting the database, it won't come back online.

All your hard work is undone because your database went down, and you didn't have a backup plan. In today's fast-paced digital landscape, downtime is not an option. Applications must be available 24/7 to meet user expectations and business demands.

So, how do you overcome this problem? The solution is high availability (HA). HA ensures that systems remain operational despite hardware failures, network issues, or other disruptions. This is particularly critical for cloud-based applications using on-disk databases or in-memory data stores, where speed and availability are non-negotiable. In this blog post, we'll explore core HA architectures, their building blocks, and how Dragonfly Cloud ensures zero downtime.

High Availability Architectures for Data Stores

So, what exactly is high availability? The concept is simple: deploy more than one data store instance in the same or different locations. If the primary instance goes down, your application falls back to another instance, ideally maintaining the same throughput.

That said, database and infrastructure engineers can adopt different strategies to achieve high availability based on their use cases. Some of these are described below.

Active-Active

An active-active architecture involves multiple nodes simultaneously handling both read and write requests. User requests can hit any of the nodes for read or write operations, which significantly increases system availability. One advantage is that latency would be more or less the same for customers across cloud regions, as they would hit the nearest node.

However, there are some downsides too—nodes must stay in sync with each other to ensure data consistency. While strong consistency is critical for many distributed on-disk databases, achieving it across multiple nodes typically requires using complex consensus algorithms—such as Paxos or Raft—to ensure all participating nodes agree on the system state before acknowledging writes. As a result, consensus introduces more data sharing between nodes, which increases network bandwidth usage and latency (in exchange for higher availability and overall throughput) compared to a standalone deployment. This approach also requires more expertise and results in higher operational costs.

Active-Passive (Primary-Replica)

In this architecture, a primary node handles all write operations, while one or more replica nodes passively receive updates from the primary and serve read-heavy workloads. The replicas are ready to take over if the primary fails, ensuring that at any given time, exactly one node is responsible for handling writes. This approach is simpler to configure than a strongly consistent setup, as it doesn't require complex consensus algorithms. Replicas merely replicate data from the primary, though they may temporarily lag behind in applying the latest changes.

The downside is that when the primary fails over to a replica, there may be an increase in latency as the new primary assumes the write workload.

High Availability Building Blocks

Setting up high availability requires additional resources and mechanisms. While there are different architectures for implementation, all share some common building blocks.

Replicas: Replicas ensure redundancy by maintaining real-time copies of data across nodes. This redundancy provides resilience against hardware or network failures. When the master fails, the data store system can fall back to one of the replicas.
Distributing Across Zones: Nodes can be deployed in a single zone/region or across different zones/regions, depending on your use case. Single-zone deployments may be better than having no backup nodes, but they remain vulnerable to zone-wide failures. In contrast, multi-region deployments are more tolerant of such failures.
Monitoring: Proactive monitoring is crucial for high availability. Without it, systems would lack visibility into the health of nodes and be unable to respond to failures. This includes monitoring node health, detecting anomalies, and preemptively addressing potential issues.
Failover: When the monitoring stack detects failures or unhealthy nodes, it should immediately trigger a failover to the next best candidate, ensuring that the data store continues to operate without interruption.

High Availability in Action: How Dragonfly Cloud Delivers Continuous Uptime

Enough theory! Now, let's explore how Dragonfly Cloud takes high availability to the next level with its innovative architecture and built-in failover mechanisms. Here's how it works.

Single-Zone and Multi-Zone Replica Configuration

Dragonfly Cloud supports both single-zone and multi-zone deployments, allowing architects to tailor configurations to their application's resilience needs. Multi-zone setups are particularly valuable for mission-critical applications that require geo-redundancy. You can configure the zone setup when creating a data store.

Data Store Health Detection

Dragonfly Cloud employs sophisticated health detection mechanisms to preemptively address potential disruptions. Each node has an agent running alongside the Dragonfly instance. The agent is responsible for two things: executing commands received from the Dragonfly Cloud controller[^1] and monitoring the health of the Dragonfly process. The agent reports the node's health back to our server, where the controller analyzes the situation and triggers a failover to the next best replica candidate if the master node is down.

An agent comprises three processes that periodically assess the node's health at regular intervals. These processes are:

Failure Detector: Continuously probes for ping failures to Dragonfly. If ping failures exceed a predefined threshold, the node is marked as unhealthy, prompting the controller to initiate a failover to a healthy replica. In some cases, the agent might successfully ping the Dragonfly process, but other nodes may still fail to connect to it. To address this, agents also ping random Dragonfly processes on other nodes and report the node as unhealthy if failures persist over a defined time period. The controller evaluates both local and remote failures of the master node to determine appropriate actions.
Health Detector: Monitors the health of the local Dragonfly instance, identifying issues such as deadlocks, role discrepancies, and replication mismatches. Any detected failures are immediately reported to the monitoring stack, alerting human operators to intervene.
CPU and Memory Monitor: Tracks resource usage to identify potential performance bottlenecks and ensure balanced load distribution across nodes. Although not directly tied to high availability, monitoring CPU and memory utilization provides valuable insights into a node's health and performance.

Failover

Dragonfly Cloud features a robust automated failover mechanism that redirects traffic to healthy replicas within seconds, minimizing downtime and preventing data loss. The Dragonfly Cloud controller spawns a reconciler for each data store. The reconciler, in addition to propagating the data store state to the nodes, is also responsible for ensuring high availability. It fetches health metrics from the node and analyzes the overall condition of the data store. If the health status is not reported for a certain period or if the agent on the master node reports it as unhealthy, the reconciler falls back to one of the healthy replicas.

We also have internal commands to trigger manual failover, useful when the automated mechanism isn't sufficient to resolve availability issues. In those rare cases, on-call engineers take over and ensure the high availability of data stores.

Note that at the moment of writing, Dragonfly Cloud doesn't yet provide an active-active configuration for in-memory workloads. This is because we are currently prioritizing performance and throughput. Implementing a multi-node consensus algorithm, which is essential for active-active setups, would introduce significant overhead and impact latency. For Dragonfly, sub-millisecond latency is crucial to delivering the high-performance, real-time experience our users and community expect, and as such, we are focusing on ensuring optimal throughput and efficiency with our current offerings.

Conclusion

High availability is the cornerstone of reliable cloud-based in-memory data stores. By leveraging active-active or active-passive architectures, distributing replicas across zones, and implementing robust failover mechanisms, architects and DevOps engineers can ensure uninterrupted service. Dragonfly Cloud exemplifies how modern in-memory data stores can deliver continuous uptime, providing peace of mind for both businesses and end users.

Whether you should configure high availability for your on-disk databases or in-memory data stores depends on your specific use cases. However, in general, high availability is something you should always consider.

If you're interested in learning more about the Dragonfly Cloud Control Plane, check out our previous blog post.