Measuring Cardinality in the Millions and Billions with Dragonfly
This blog post explores using Dragonfly for measuring cardinality with precision at massive scales. Focusing on `Set` for exact counts and `HyperLogLog` for accurate estimations.
February 13, 2024
Introduction
Cardinality plays a crucial role in data management, user behavior analysis, and user engagement tracking. It refers to the unique count of elements within a dataset. In the context of using an in-memory data store, the functionality of rapid cardinality measurement can significantly offload the database during high-traffic periods. In this blog post, we will explore how to utilize the Set
and the HyperLogLog
data types that Dragonfly supports to measure cardinality at massive scales.
Exact Cardinality Measurement with Set
1. Measuring Unique Page Visits
Imagine we want to track page visits by unique users on a website. The first solution that might come to mind is using the Set
data type since it naturally de-duplicates entries, ensuring that each element in the set is unique. To measure the cardinality of unique users, we can use the SADD
command:
dragonfly$> SADD page_visits:{page_id} {user_id}
When a user visits a page, we add their user ID to the set of page visits for that page. At any time, we can measure the cardinality of the set to determine the number of unique users who have visited the page.
dragonfly$> SCARD page_visits:{page_id}
Since the Set
data type tracks the number of unique elements in it, the SCARD
operation returns the exact cardinality within O(1)
time complexity. In a previous blog post, we discussed that working with BigKeys in Dragonfly is a fearless endeavor. However, there are still some best practice guidelines to follow to avoid unnecessary BigKeys. While Set
is effective for datasets of moderate size, its usability starts to diminish as we scale into the territory of billions of unique elements. The primary challenge in this context is memory usage. Storing vast numbers of unique identifiers requires a proportional amount of memory, which can become prohibitive at large scales. Additionally, operations on large sets (i.e., SDIFF
) can become less efficient due to their O(N)
nature impacting the performance of the data store and the applications relying on it. This is where HyperLogLog
comes into play. But before that, let's look at another command that Dragonfly provides.
2. Measuring Cardinality within a Time Window
Dragonfly has a unique command, SADDEX
, which allows you to add elements to a set and automatically expire them after a certain time. Note that this command is not available in Redis. It expires individual elements within the set rather than the key holding the entire set itself. Let's assume that a user spends 5 minutes on a certain page. We can use the following command to add the user ID to the set and schedule the expiration time to 5 minutes, or 300 seconds, from now:
dragonfly$> SADDEX page_visits:page_that_user_spent_5_mins_on 300 user_001
After 5 minutes, the user ID will be automatically removed from the set. And at any time, we can measure the cardinality of the set to determine the number of unique users who have visited the page within the last 5 minutes.
dragonfly$> SCARD page_visits:page_that_user_spent_5_mins_on
This is a powerful feature for tracking user engagement and user behavior within a certain time window while keeping only the necessary items in memory. Next, we will explore the HyperLogLog
data type, which is designed to measure cardinality at extreme scales.
Accurate Cardinality Estimation with HyperLogLog
HyperLogLog
is a probabilistic data structure that provides a dataset's approximate count of unique elements. It is designed to measure cardinality at massive scales while consuming a fraction of the memory that a Set
would require. Personally, I find this video to be one of the best explanations of HyperLogLog
and its underlying mechanisms. Here's a brief overview of the key characteristics of HyperLogLog
:
- A small amount of memory (~1.5KB) can count huge numbers of unique elements (10^9).
- Despite an estimation, it is very accurate, with a typical error of ~2%.
- It may slightly undercount, but it never overcounts.
- We cannot retrieve the unique elements since hash functions are involved in counting, and the original elements are not stored.
With these features in mind, let's use the page visits by unique users example, same as above, to illustrate how to use HyperLogLog
commands. We can use the PFADD
command to add elements to a HyperLogLog
data structure. Once the elements are added, the PFCOUNT
command can be used to retrieve the result.
dragonfly$> PFADD page_visits:super_busy_page user_001 user_002 user_003
(integer) 1
dragonfly$> PFCOUNT page_visits:super_busy_page
(integer) 3
Note that the time complexity of PFADD
is O(1)
for a single element added. The time complexity of PFCOUNT
is also O(1)
, no matter how many elements are added to a HyperLogLog
entry. That's it! Leveraging a complex data structure to estimate large cardinalities accurately is just that simple.
Getting Started with Dragonfly
As an in-memory data store, Dragonfly is usually used as a high-performance cache for frequently accessed data. However, with various data types and commands, it can also be used to measure cardinality at massive scales and much more. Don't wait anymore! Let's get started with Dragonfly today and use our creativity to develop different use cases and explore many other possibilities.