Best Practices for Managing In-Memory Data Stores

In-Memory Data Stores

In-memory data stores have evolved significantly over the past few decades, starting with Memcached, which was released in 2003 as a general-purpose distributed memory caching system to speed up web applications by relieving database load. Redis, introduced in 2009 by Salvatore Sanfilippo, brought a new dimension to in-memory storage by offering a rich set of data structures such as strings, lists, sets, and hashes, along with persistence options, making it versatile for various applications beyond caching. In 2019, KeyDB emerged as a high-performance fork of Redis, maintaining full compatibility while introducing multi-threading capabilities to leverage modern multi-core processors, thereby offering enhanced performance and scalability. More recently, Dragonfly, a modern multi-threaded high-performance drop-in replacement for Redis, has entered the scene.

RAM: High Speed with Volatility

By nature of how they work, in-memory data stores make it possible for applications to be highly performant. High RAM speeds make it possible to retrieve data much faster than traditional disk storage. Rich data structures, Pub/Sub messaging, streams, auto-expiration, and server-side scripting are all available with minimal latency. All of these features all but guarantee high performance for your application, right?

Well, yes, but only if the potential risks are properly mitigated. The volatility of RAM and not strongly guaranteed persistence could result in data loss in case of a system failure, crash, or power-down. The size limitations of RAM can result in scaling challenges. Misconfigured access control can present security issues as well. But you can get those high speeds and performance with low latency without succumbing to these pitfalls. In this blog post, we will explore best practices for running in-memory data stores.

Snapshots, Backups, and Restoration

Snapshots can capture the state of the data store at a specific point in time and backups make copies of snapshots to be stored in a different location for long-term preservation. Once an in-memory data store instance is restarted, the data is lost unless it is restored from a backup, or gradually rebuilt from subsequent requests if used as a cache.

If done manually, regular snapshots and backups can be tedious and error-prone because of frequency and monitoring requirements. But by automating everything, this tediousness is not a barrier to putting a process in place to get regular snapshots and backups. That's great because using both helps to mitigate the risk of data loss and allows for faster recovery upon restart (since snapshots are normally in binary format, loading snapshots back into memory is fast by design). It's worth noting that the data is mainly stored and manipulated in memory, and writing the data set to disk can prevent data loss to an extent. However, we still would not recommend using in-memory data stores as your primary database.

So what are the best practices to keep in mind for taking snapshots? Keeping in mind that apps with high data change rates, shorter RPO (recovery point objective), and/or mission-critical data management will need a higher frequency of snapshots and automation should be considered to reduce the tediousness and potential for human errors. Think about these next steps specifically for better snapshotting and backup practices:

Establish clear retention policies to manage snapshot storage and remember to add automatic deletion of snapshots that are no longer needed.
Validate the integrity of your snapshots and confirm proper data collection.
Schedule snapshots during low-usage periods when possible and ensure that sufficient resources are allocated for the process to minimize impact on application performance.
Remember to maintain updated documentation (and use consistent naming conventions) for your snapshot process and provide training to ensure your team's readiness for recovery in the case of a failure.

Now that you have your snapshots, you need to be able to back up and restore your data as needed. Here are some considerations to keep in mind:

Your storage options depend on your RTO (recovery time objective), RPO, and budget. There are several choices, including local storage, which would be fast and cheap, but also vulnerable to local failures. Cloud options like AWS S3 and Google Cloud Storage would be highly scalable and durable but with potentially higher costs. Different types of storage can be best suited for different operations.
Define a retention period (the duration for which snapshot is stored before being deleted or overwritten) and automate retention management for consistency. Note that normally, it's okay for an in-memory data store to retain only the latest snapshot, but sometimes multiple snapshots may be needed depending on the application's needs.
It is also important to test the process of data restoration from backup regularly. The backup system and procedures must run smoothly to protect the data in an emergency. It is a good idea to rehearse this process to verify the reliability and effectiveness of your backups.

So let's take a closer look at how snapshots and backups can be approached with Memcached, Redis, Valkey, KeyDB, and Dragonfly. You'll see that several of the options include mechanisms for snapshots and backups natively, but even if you choose an in-memory data store that requires a third-party tool for this best practice, it is absolutely worth the additional effort.

	Memcached	Redis, Valkey, KeyDB	Dragonfly
Backup Options	Not out-of-box, third-party tools available	RDB Snapshot, AOF (Append Only File)	Snapshot
Automation	Depending on the third-party tool you choose. Additional development work may be needed.	RDB: Based on time period and number of operations during that period. Multiple configurations are allowed. AOF: Can be configured as 'always', 'everysec', or 'no'.	Snapshot is periodic and can be configured using cron spec.
Storage Location	Depending on the third-party tool you choose. Additional development work may be needed.	Locally by default.	Locally by default. Can be configured to backup to and restore from cloud storage (i.e., AWS S3).
Risks	Third-party tool reliability.	Redis uses the fork system call to create snapshots. Over-provision of busy instances is a must.	By default Dragonfly keeps snapshots for different timestamps and doesn't remove them automatically. It's good by design, but you need to manage your storage space.
Best Practices	Be well-trained with your choice of the third-party tool.	The mixed usage of RDB and AOF is recommended.	Periodically remove old snapshots. Other than that, Dragonfly snapshotting speed is very fast and doesn't spike memory usage on busy instances during the process.

High Availability

High availability is essential for in-memory data stores to ensure uninterrupted access to data (minimizing downtime) and provide a reliable infrastructure, given they are using RAM—fast but volatile.

The key mechanism for maintaining high availability is implementing replication and automatic failover. Data replication across multiple nodes or data centers ensures that a copy is always available even if one node fails, creating data redundancy. Replicas are normally read-only and can serve read requests, which further offloads stress from the primary node. This helps with availability since a primary node failure is less likely to occur. Here's what to keep in mind as you are setting up replication:

After identifying your replication architecture and setting up the master and replica nodes, you need to configure the replicas for failover. In other words, this sets up an environment in which if the primary node (master) fails, one of the replica nodes (secondary) automatically takes over as the new primary.
Automatic failover can be implemented using tools like Redis Sentinel to continuously check the health of primary and replica nodes and promote a replica as the primary automatically in case of failure.
For applications running on Kubernetes, operators can be used to manage the lifecycle of in-memory data stores and ensure high availability by automating tasks such as deployment, scaling, failover, and backup.

	Memcached	Redis, Valkey, KeyDB	Dragonfly
HA Options	No de-facto choice, but third-party tools available.	Replication, Sentinel, K8s Operator	Replication, Sentinel, K8s Operator
Risks	Third-party tool reliability.	Replication relies on RDB for the first-time full-sync. Multiple replicas would add stress to the primary instance.	Dragonfly is compatible with Sentinel. However, the failover (primary election) behavior may not be exactly the same with Redis.
Best Practices	Be well-trained with your choice of the third-party tool.	Choose the most suitable tool for your application. Do not add too many replicas, 2-4 replicas on a single primary could be a fine range	Able to support more replicas. However, adding too many replicas still only offloads read operations. So it should be in a reasonable range depending on your write/read loads.

Check out the offerings from the in-memory data stores on the market to maintain high availability, along with their accompanying risks and best practices.

Scalability: Horizontal vs. Vertical

With the capacity limitations of RAM, dealing with high volumes of data and rapid growth is always a problem. Understanding scalability challenges and the pros and cons of different scaling methods is essential for designing an efficient and reliable system.

As a system scales, high data volume can cause the memory of individual nodes to be insufficient. This volume comes with a larger number of read/write requests, which can overwhelm a single node, cause bottlenecks, and mess up throughput metrics. Maintaining low latency and high availability becomes difficult as well, not to mention the overall operational complexity. Problematic? Yes, but not insurmountable with the right vertical and horizontal scaling techniques.

Vertical scaling, or scaling up, increases the capacity of a single node by adding CPU, memory, or storage. This is the simpler way to scale, with a single data store making it easier to manage data consistency and the upgraded node alone improving performance. But one node can only scale so much, and with the single node being the single point of failure and requiring expensive upgrades to properly scale up performance, the costs make the advantages less and less valuable as they grow. Here are some vertical scaling techniques:

Upgrade the CPU to reduce processing time.
Upgrade to a bigger machine with more memory to handle larger data sets.
Improve storage by using SSDs instead of traditional hard drives for faster backup speed.

Note that your options for increasing CPU resources can depend on which in-memory data store you are using. For example, without clustering, Redis can only utilize a single CPU core, no matter how powerful the server machine is. Dragonfly (yes, shameless plug, but true!), running as a single process, can utilize all CPU cores in a multi-core server machine.

Horizontal scaling, or scaling out, adds more nodes to the system, distributing the data across multiple machines in clusters. This increases the potential for scalability, lightens the load of each node, and requires less expensive upgrades. The complexity of horizontal scaling and maintaining the infrastructure for it, however, indicates that it may be a much more significant commitment in the long run. Here are some horizontal scaling techniques:

With consistent hashing, partition data into shards, each for a different node, to allow for parallel processing.
Implement the primary-replica topology for each partition to further balance the read load among multiple nodes.
If you are using Memcached, you'll find that it lacks built-in replication, but you can still handle clustering through client-side sharding, where the client library is responsible for data distribution. If you're using Redis or Dragonfly, you'll have access to native clustering with automatic sharding, replication, and failover.

Because both have their advantages and disadvantages, it would follow that the most benefit could be gleaned by combining vertical and horizontal scaling. And depending on your app's specific needs, it absolutely could. Start with vertical scaling to optimize single-node performance before adding horizontal scaling to distribute and balance the load among multiple nodes.

Network Security

In addition to protecting sensitive data like PII and financial records, high-quality network security is also necessary for data integrity (to avoid tampering and maintain accuracy), regulation compliance, and of course, preventing data breaches. Internal threats like the wrong people accessing the data stores and external threats like cyber attacks in addition to attacks on intellectual property all require mitigation using predetermined security measures.

Best practices for network security for in-memory data stores include the following:

Regulate access control to manage internal threats with role-based access controls (RBAC) to authorize specific users with access and permissions based on user verification. Virtual Private Cloud (VPC) and IP whitelisting further help to restrict access and isolate resources.
Encrypt data at rest and in transit to avoid unauthorized access or interception and implement strong encryption key management practices.
Use firewalls and VPNs to secure the network and communication channels along with segmenting the network to limit the spread of any attacks.
Implement intrusion detection and prevention systems to detect and block attacks before they can do damage.

Don't forget that in addition to setting up the tools and technology to secure a network, actually monitoring and logging activity is necessary for prompt risk mitigation. This involves audit logs and real-time monitoring to detect and respond to suspicious activities promptly. Regular vulnerability assessments and staying updated with the latest security patches are also key.

Security is a deep topic, and we plan to release an article getting into the details and VPC peering in the near future, so stay tuned!

Use Cases & Considerations

Let's consider the importance of these best practices in the context of a use case. Caching, for example, requires snapshots & backups, high availability, and scalability. Regular snapshots and backups prevent data loss, ensuring that the cache can be quickly restored in case of failure. High availability through replication and failover mechanisms ensures the cache remains accessible even if some nodes go down, preventing a surge in database requests that could crash the primary database, rendering the system unusable and leading to financial losses. Vertical and horizontal scaling allows the cache to handle increased loads efficiently, maintaining performance as the application grows.

On the other hand, in-memory data stores can often be used for session token storage, so network security is paramount. Without proper security measures, such as encryption and access controls, session tokens can be intercepted or tampered with, leading to unauthorized access and potential data breaches. Ensuring robust network security protects user sessions and maintains trust in the application's security integrity.

Conclusion

In-memory data stores can enhance performance in many ways in a complex backend system. However, to fully benefit, it's crucial to manage potential pitfalls. Focus on snapshots, backups, high availability, scalability, and security to ensure smooth and reliable performance.

Do you have any other best practice tips for in-memory data stores that we missed? We'd love to hear your thoughts on Dragonfly's Discord!