Cuckoo Filter Implementation Details #2539
Closed
jonathanc-n
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
references #2534
Overview
The cuckoo filter is similar to the bloom filter however it is able to support deletions and is supported by Redis. Good for practical workloads. It uses cuckoo hashing as the hashing scheme.
Cuckoo Hashing
The Cuckoo Filter uses a technique called Cuckoo Hashing to decide where to place items. When an item is added, it gets hashed into one of two (or more) possible spots in a table. If both spots are already taken, it "kicks out" one of the existing items and moves it to its alternate location. This process keeps going until either an empty spot is found, or the table gets too full, which might mean it's time to resize.
Fingerprints:
Instead of storing the whole item, the Cuckoo Filter just keeps a short "fingerprint" of each one. This fingerprint is generated from the item using a hash function, which saves memory. It’s what the filter uses to check if an item is already in the set.
Advantages
Space Efficiency: The Cuckoo Filter is more space-efficient than a Bloom filter, especially for smaller sets of items, as it stores only the fingerprint rather than the full element. (however for larger sets the memory overhead for growing fingerprint sizes increases)
Support for Deletion: Unlike Bloom filters, which do not support deletion, the Cuckoo Filter allows for easy removal of elements from the filter. It is the only data structure for membership testing that uses less or equal space compared to bloom filters which supports deletions.
Lower False Positive Rate: For the same amount of memory, Cuckoo Filters tend to have lower false positive rates than Bloom filters.
Supported Commands
CF.ADD - Adds an item to a Cuckoo Filter.
CF.ADDNX - Adds an item to a Cuckoo Filter if the item did not exist previously.
CF.COUNT - Returns the number of times an item might be in a Cuckoo Filter.
CF.DEL - Deletes an item from a Cuckoo Filter.
CF.EXISTS - Checks whether one or more items exist in a Cuckoo Filter.
CF.INFO - Returns information about a Cuckoo Filter.
CF.INSERT - Adds one or more items to a Cuckoo Filter. A filter will be created if it does not exist.
CF.INSERTNX - Adds one or more items to a Cuckoo Filter if the items did not exist previously. A filter will be created if it does not exist.
CF.MEXISTS - Checks whether one or more items exist in a Cuckoo Filter.
CF.RESERVE - Creates a new Cuckoo Filter.
CF.LOADCHUNK - Restores a filter previously saved using SCANDUMP.
CF.SCANDUMP - Begins an incremental save of the Cuckoo Filter.
Implementation
Sub-Cuckoo Filters (SubCF)
Purpose: To allow the Cuckoo Filter to grow dynamically by adding multiple SubCF instances, each representing a segment of the filter with its own buckets.
Attributes:
bucket_size: Number of slots per bucket.
num_buckets: Total number of buckets in the sub-filter.
data: A vector storing the fingerprints of inserted items. Initialized with CUCKOO_NULLFP to indicate empty slots.
Main Cuckoo Filter Attributes
Capacity Management:
capacity_: Maximum number of items the filter can hold.
num_buckets_: Current number of buckets.
num_filters_: Number of SubCF instances.
Operational Parameters:
bucket_size_: Number of slots per bucket.
max_iterations_: Maximum number of displacements during insertion to prevent infinite loops.
expansion_: Factor by which the filter grows when capacity is exceeded.
Statistics:
num_items_: Current number of items.
num_deletes_: Number of deletions performed.
Serialization
Capacity will be 64 bits, bucket_size 16 bits, max iterations 16 bits, expansion, 16 bits, number of buckets 64 bits, hash function identifier 8 bits, num of elements 64 bits
the filters has two options, i will just serialize it with metadata, bucketsize 16 bits, number of buckets 64 bits, data 8 bits.
Cuckoo Index
will implement in the future
Source:
https://redis.io/docs/latest/develop/data-types/probabilistic/cuckoo-filter/
https://redis.io/docs/latest/commands/?group=cf
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf (the indexing schema that I was thinking to implement)
Beta Was this translation helpful? Give feedback.
All reactions