Introduction
In an earlier blog post, we explored using Dragonfly for caching solutions and how to avoid common pitfalls, such as cache penetration, cache breakdown, and cache avalanche. At that time, we had not yet implemented Bloom filters in Dragonfly. However, with the release of v1.17.0, this has changed—Bloom filters are now natively supported in Dragonfly. It's time to revisit this powerful data structure and see how it can be used in real-world scenarios. Let's dive in!
Bloom Filter 101 - How It Works
The Bloom filter, as originally designed by Burton Howard Bloom in 1970, is a memory-efficient probabilistic data structure. It's primarily used to determine if an element is part of a set. A unique aspect of a Bloom filter is that:
- It never yields false negatives, assuring an item is definitely not in the set.
- It allows for false positive results, indicating an item is in the set when it's not.
Elements can be added to a Bloom filter with constant time complexity as it uses hash functions. Once an element is added, it cannot be easily removed. Meanwhile, as more elements are added, the likelihood of false positives increases. As we always say, there's no silver bullet, only trade-offs. With these characteristics in mind, Bloom filters are still invaluable for applications where speed and memory efficiency are crucial and minor inaccuracies are acceptable.
Underlying Mechanism
A Bloom filter starts as a zeroed bit array of m
bits.
It uses k
hash functions, each mapping an element to one of the m
positions in the bit array.
The value of k
is chosen based on the desired error rate,
and m
is sized to accommodate the expected number of elements while balancing error rate and space.
In the illustrations below, let's use k = 3
hash functions and m = 10
bits for simplicity.
Adding an Element
To add an element to a Bloom filter, pass it through the three hash functions to find the corresponding bit positions.
Then bits at these positions are changed to 1
.
In Dragonfly, you can add element(s) to a Bloom filter using the BF.ADD
or BF.MADD
command:
dragonfly$> BF.ADD my_bloom_filter item_01
(integer) 1 # The element was added successfully.
Querying an Element
To check if an element is in a Bloom filter, pass it through the same three hash functions to find the corresponding bit positions. Interestingly, we can observe that:
- If any bit at these positions is
0
, the element is definitely not in the Bloom filter. - If all bits are
1
, the element might be in the Bloom filter, or it could be a false positive where the positions were changed to1
by other hash-colliding elements.
In Dragonfly, you can query if element(s) exist in a Bloom filter using the BF.EXISTS
or BF.MEXISTS
command:
# 'item_01' might be in the Bloom filter,
# since all bit positions are '1' in the graphic above.
dragonfly$> BF.EXISTS my_bloom_filter item_01
(integer) 1
# 'item_02' is definitely not in the Bloom filter,
# since at least one bit position is '0' in the graphic above.
dragonfly$> BF.EXISTS my_bloom_filter item_02
(integer) 0
As shown above, Bloom filters are highly memory-efficient compared to other data structures like binary search trees or hash tables, which usually store actual data items.
Instead, Bloom filters use a bit array to indicate the presence of elements, significantly reducing space requirements.
To give a sense of scale, with optimal k
and m
values, a Bloom filter only needs about 9.6 bits per element to achieve a 1% error rate.
Additionally, Bloom filters benefit from constant-time hash operations, O(k)
, for both adding an item and checking for membership.
Using Bloom Filters to Prevent Cache Penetration
As discussed in our previous blog post, cache penetration is a specific type of attack where an application is bombarded with requests using deliberately non-existent IDs, leading to cache misses and unnecessary database queries. In such cases, even caching empty results, a strategy we can employ in Dragonfly, may not be fully efficient if the volume of deceptive requests is high, ultimately wasting valuable memory resources with useless data.
With the incorporation of Bloom filters in Dragonfly, we now have an even more robust solution to prevent cache penetration. Consider an example where we're developing a microservice for managing order executions within an eCommerce platform. We can set up a Bloom filter as an allow list for order IDs. Each time an order is placed, its ID is added to the Bloom filter as well. For every incoming order query, the system first checks the Bloom filter. If the order ID isn't captured in the Bloom filter, we can confidently reject the request immediately, knowing the order doesn't exist, thus avoiding unnecessary cache and database lookups.
If the Bloom filter indicates the presence of the ID, the request proceeds to the normal flow: first checking the cache, then the database if necessary. It's important to note that because Bloom filters may yield false positives, a small number of non-existing order IDs might still pass this initial check. However, these cases are minimal, and we can still cache empty results for such IDs without significantly wasting memory. By effectively filtering out the majority of fraudulent queries with the Bloom filter, the system becomes far more efficient, reducing the overhead on both the cache and the database. And of course, we can use a single Dragonfly server as both the Bloom filter allow list and the caching layer, simplifying the overall architecture.
Using Bloom Filters to Build Email Spam Lists
In the cache penetration example above, we used a Bloom filter as an allow list to prevent unnecessary database queries. Now, let's explore another use case where Bloom filters are employed as a spam list or block list for filtering out unwanted emails.
When an incoming email arrives, it is first checked against the Bloom filter. If the sender's email address is in the Bloom filter, we can assume that the email is spam and flag it accordingly. This system operates under the assumption that the occasional wrong categorization is acceptable, emphasizing efficiency and speed.
To provide a better user experience, we should also consider the following cases:
- A legitimate email is incorrectly marked as spam, and a user wants to allow it.
- An email is not flagged by the fraud detection system, but a user wants to block it.
To address these cases, a secondary Hash
data type can be used to store emails manually categorized by a user.
This ensures that these emails bypass the spam filter check in future interactions, reassuring more accurate results.
Meanwhile, regular updates and periodic recreations of the Bloom filter are crucial for ensuring its effectiveness and accuracy,
keeping it current with new spam sources, and removing outdated entries.
This approach effectively manages email spam with minimal database queries, maintaining high system responsiveness.
Choosing the Capacity and Error Rate
Other than using the BF.ADD
or BF.MADD
command to create and add elements to a Bloom filter using default settings,
Dragonfly also provides the BF.RESERVE
command to specify the capacity
and false_positive_rate
configurations while creating a Bloom filter.
# Create a Bloom filter with:
# - A false positive rate of 0.0001.
# - A capacity of 500,000 elements.
dragonfly$> BF.RESERVE my_bloom_filter_configured 0.0001 500000
OK
Selecting the appropriate capacity ensures that the Bloom filter remains effective as the number of elements approaches this limit. If too low a capacity is chosen, the false positive rate may increase significantly. Conversely, setting the false positive rate involves balancing the tolerance for errors against performance needs. A lower false positive rate requires more bits per element and more hash functions, using more memory and processing power. Adjusting these parameters according to your specific needs and constraints helps optimize performance and accuracy.
Conclusion
Dragonfly's native support for Bloom filters empowers developers to handle advanced filtering use cases with memory efficiency. By leveraging Bloom filters, you can significantly reduce the load on your cache and database servers when facing cache penetration attacks or building spam filters. The characteristics of Bloom filters make them a valuable tool for applications where speed and memory are critical and minor inaccuracies are acceptable.
Stay updated on further developments by subscribing to our newsletter below and joining our Discord server. We look forward to seeing the innovative ways you'll use Bloom filters and other features of Dragonfly in your applications!