-
Notifications
You must be signed in to change notification settings - Fork 0
Introduce volatile_set implementation #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe this design can work well.
It just needs some polishing regarding naming and snake case / camel case, etc.
I quite like the prefix "vset" for volatile set. I think you should rename the struct to vset, the function prefix of all of these to vset and the file names to vset.{c,h}.
If we want to use this also for the main keyspace later (where the elements are not odd-aligned) we can for example skip single-element encoding if we try to add an element that is not odd-aligned. In that case, we can go directly to vector encoding.
Signed-off-by: Ran Shidlansik <[email protected]>
+1 I agree, will make order to it.
wel, it is possible.
So for objects we can still do it by passing the ket sds pointer. since we know that the object will have the expiry we can go to the allocation start. I think the main challenge for generic keys is maintaining the updates (when the object is relocated, but I do not think that is something that happens outside of defrag, since once the object has expiry it will probably not be relocated). Another challenge is making the vset per slot. this will require some implementation ~kvstore (or maybe add it to the kvstore somehow). |
|
@zuiderkwast Thank you sooooooo much for taking the time to review this. I will make sure to work on your comment first thing on Sunday |
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
2. change pointer_vector to pVecotr 3. multiple pr comments fix Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
… entry expiry. This is the first change in order to reduce the vset default memory consumption. Although this complicates the API, it allows reducing the memory footprint per each hash object using the set. The next potential step is to make the vset a pure bucket pointer so that it will not use any extra memory. I intentionally separated these changes in order for us to be able to decide if "sacrifice" API friendly is better than consuming more memory Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pass on this. I tend to get stuck on details like naming rather than reviewing the actual logic. :D
I understand. maybe in this specific review case the naming, style and the external API are the more important thing. regarding the algorithm, I think I can help focus you on several questions:
so it will internally iterate on buckets and remove entries up to max_count OR when the expiryFunc tels it to stop. |
Signed-off-by: Ran Shidlansik <[email protected]>
1. Fix unittest 2. change create and delete vset functions to init and clear Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm starting to grasp this design.
the design is attempting to keep memory access local. this means that we would prefer not to access the entries expiration time too often. Still I choose to keep vectors sorted. I did that since it simplified the implementation. an alternative would be to just sort the vector before attempt to split it and for the other API cases we can just search the entire vector when we need to (more expensive than just access the minimum)
Just an idea: If you want to use unsorted vectors, use a max size of 8. This makes sure the unsorted vector fits in a cache line, so linear search is still really fast. (It guess it can even be made O(1) using AVX512.)
| * Expiry Buckets and Pointer Tagging | ||
| *----------------------------------------------------------------------------- | ||
| * | ||
| * Internally, the `vset` maintains a single `vsetBucket*` pointer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hashtable.{c,h} I put the API overview in the .h file and the internals overview in the .c file. Not sure if it matters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try and polish it near the end of it.
|
The PR top-comment will maybe not be kept, but just thinking of the Evolution of the Bucket diagram. It makes no distinction between SINGLE and VECTOR on the top-level vs within a rax. These are quite different and I got confused by the diagram at first. Idea below the line: Volatile set top-level structure: If the top-level element is a RAX, it has child buckets of type: Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level). |
Yeh. It does hint about it, since the HT is only shown as a rax bucket, But I like your style better. I will change that |
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
|
I see there are new memory leaks. will search for it. |
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you describe somewhere when and where the vector is sorted vs unsorted?
| vsetGetExpiryFunc getExpiry = vsetGetExpiryGetter(); | ||
| long long ea = getExpiry(*(void **)a); | ||
| long long eb = getExpiry(*(void **)b); | ||
| return (ea > eb) - (ea < eb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one-line comparison.
Would return a - b; work too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one-line comparison.
Would
return a - b;work too?
Well, not if we want to be compatible to returning [-1, 0, 1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not required IIUC. From man qsort:
The comparison function must return an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is possible
Sure. |
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
…y-io#2257) **Current state** During `hashtableScanDefrag`, rehashing is paused to prevent entries from moving, but the scan callback can still delete entries which triggers `hashtableShrinkIfNeeded`. For example, the `expireScanCallback` can delete expired entries. **Issue** This can cause the table to be resized and the old memory to be freed while the scan is still accessing it, resulting in the following memory access violation: ``` [err]: Sanitizer error: ================================================================= ==46774==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000003100 at pc 0x0000004704d3 bp 0x7fffcb062000 sp 0x7fffcb061ff0 READ of size 1 at 0x611000003100 thread T0 #0 0x4704d2 in isPositionFilled /home/gusakovy/Projects/valkey/src/hashtable.c:422 #1 0x478b45 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1768 #2 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729 #3 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402 #4 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297 #5 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269 #6 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577 #7 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370 valkey-io#8 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513 valkey-io#9 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543 valkey-io#10 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291 valkey-io#11 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139) valkey-io#12 0x452e39 in _start (/local/home/gusakovy/Projects/valkey/src/valkey-server+0x452e39) 0x611000003100 is located 0 bytes inside of 256-byte region [0x611000003100,0x611000003200) freed by thread T0 here: #0 0x7f471a34a1e5 in __interceptor_free (/lib64/libasan.so.4+0xd81e5) #1 0x4aefbc in zfree_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:400 #2 0x4aeff5 in valkey_free /home/gusakovy/Projects/valkey/src/zmalloc.c:415 #3 0x4707d2 in rehashingCompleted /home/gusakovy/Projects/valkey/src/hashtable.c:456 #4 0x471b5b in resize /home/gusakovy/Projects/valkey/src/hashtable.c:656 #5 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272 #6 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448 #7 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459 valkey-io#8 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847 valkey-io#9 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490 valkey-io#10 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831 valkey-io#11 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844 valkey-io#12 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70 valkey-io#13 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139 valkey-io#14 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770 valkey-io#15 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729 valkey-io#16 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402 valkey-io#17 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297 valkey-io#18 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269 valkey-io#19 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577 valkey-io#20 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370 valkey-io#21 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513 valkey-io#22 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543 valkey-io#23 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291 valkey-io#24 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139) previously allocated by thread T0 here: #0 0x7f471a34a753 in __interceptor_calloc (/lib64/libasan.so.4+0xd8753) #1 0x4ae48c in ztrycalloc_usable_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:214 #2 0x4ae757 in valkey_calloc /home/gusakovy/Projects/valkey/src/zmalloc.c:257 #3 0x4718fc in resize /home/gusakovy/Projects/valkey/src/hashtable.c:645 #4 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272 #5 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448 #6 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459 #7 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847 valkey-io#8 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490 valkey-io#9 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831 valkey-io#10 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844 valkey-io#11 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70 valkey-io#12 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139 valkey-io#13 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770 valkey-io#14 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729 valkey-io#15 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402 valkey-io#16 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297 valkey-io#17 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269 valkey-io#18 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577 valkey-io#19 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370 valkey-io#20 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513 valkey-io#21 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543 valkey-io#22 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291 valkey-io#23 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139) SUMMARY: AddressSanitizer: heap-use-after-free /home/gusakovy/Projects/valkey/src/hashtable.c:422 in isPositionFilled Shadow bytes around the buggy address: 0x0c227fff85d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff85e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff85f0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd 0x0c227fff8600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff8610: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa =>0x0c227fff8620:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff8630: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff8640: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8650: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8660: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8670: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==46774==ABORTING ``` **Solution** Suggested solution is to also pause auto shrinking during `hashtableScanDefrag`. I noticed that there was already a `hashtablePauseAutoShrink` method and `pause_auto_shrink` counter, but it wasn't actually used in `hashtableShrinkIfNeeded` so I fixed that. **Testing** I created a simple tcl test that (most of the times) triggers this error, but it's a little clunky so I didn't add it as part of the PR: ``` start_server {tags {"expire hashtable defrag"}} { test {hashtable scan defrag on expiry} { r config set hz 100 set num_keys 20 for {set i 0} {$i < $num_keys} {incr i} { r set "key_$i" "value_$i" } for {set j 0} {$j < 50} {incr j} { set expire_keys 100 for {set i 0} {$i < $expire_keys} {incr i} { # Short expiry time to ensure they expire quickly r psetex "expire_key_${i}_${j}" 100 "expire_value_${i}_${j}" } # Verify keys are set set initial_size [r dbsize] assert_equal $initial_size [expr $num_keys + $expire_keys] after 150 for {set i 0} {$i < 10} {incr i} { r get "expire_key_${i}_${j}" after 10 } } set remaining_keys [r dbsize] assert_equal $remaining_keys $num_keys # Verify server is still responsive assert_equal [r ping] {PONG} } {} } ``` Compiling with ASAN using `make noopt SANITIZER=address valkey-server` and running the test causes error above. Applying the fix resolves the issue. Signed-off-by: Yakov Gusakov <[email protected]>
…-volatile-set Signed-off-by: Ran Shidlansik <[email protected]>
After the changes to vset, we need to check is the set is empty and not check it's pointer Signed-off-by: Ran Shidlansik <[email protected]>
nasty bug. should be fixed now |
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
|
note for the reviewers: I think in the current implementation vsetUpdateEntry is very sub-optimal. it always remove and re-insert the entry. I think this performs bad in case of defrag and also in cases were the entry is not switching buckets. I will update the code with change for this. |
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
done |
|
So this PR was accidentally merged to the origin branch. I reverted these changes but I cannot reopen the PR and most of the comments here are cleared. lets move to work on #5 |
-------------
Overview:
---------
This PR introduces a complete redesign of the 'vset' (stands for volatile set) data structure,
creating an adaptive container for expiring entries. The new design is
memory-efficient, scalable, and dynamically promotes/demotes its internal
representation depending on runtime behavior and volume.
The core concept uses a single tagged pointer (`expiry_buckets`) that encodes
one of several internal structures:
- NONE (-1): Empty set
- SINGLE (0x1): One entry
- VECTOR (0x2): Sorted vector of entry pointers
- HT (0x4): Hash table for larger buckets with many entries
- RAX (0x6): Radix tree (keyed by aligned expiry timestamps)
This allows the set to grow and shrink seamlessly while optimizing for both
space and performance.
Motivation:
-----------
The previous design lacked flexibility in high-churn environments or
workloads with skewed expiry distributions. This redesign enables dynamic
layout adjustment based on the time distribution and volume of the inserted
entries, while maintaining fast expiry checks and minimal memory overhead.
Key Concepts:
-------------
- All pointers stored in the structure must be odd-aligned to preserve
3 bits for tagging. This is safe with SDS strings (which set the LSB).
- Buckets evolve automatically:
- Start as NONE.
- On first insert → become SINGLE.
- If another entry with similar expiry → promote to VECTOR.
- If VECTOR exceeds 127 entries → convert to RAX.
- If a RAX bucket's vector fills and cannot split → promote to HT.
- Each vector bucket is kept sorted by `entry->getExpiry()`.
- Binary search is used for efficient insertion and splitting.
# Coarse Buckets Expiration System for Hash Fields
This PR introduces **coarse-grained expiration buckets** to support per-field
expirations in hash types — a feature known as *volatile fields*.
It enables scalable expiration tracking by grouping fields into time-aligned
buckets instead of individually tracking exact timestamps.
## Motivation
Valkey traditionally supports key-level expiration. However, in many applications,
there's a strong need to expire individual fields within a hash (e.g., session keys,
token caches, etc.).
Tracking these at fine granularity is expensive and potentially unscalable, so
this implementation introduces *bucketed expirations* to batch expirations together.
## Bucket Granularity and Timestamp Handling
- Each expiration bucket represents a time slice of fixed width (e.g., 8192 ms).
- Expiring fields are mapped to the **end** of a time slice (not the floor).
- This design facilitates:
- Efficient *splitting* of large buckets when needed
- *Downgrading* buckets when fields permit tighter packing
- Coalescing during lazy cleanup or memory pressure
### Example Calculation
Suppose a field has an expiration time of `1690000123456` ms and the max bucket
interval is 8192 ms:
```
BUCKET_INTERVAL_MAX = 8192;
expiry = 1690000123456;
bucket_ts = (expiry & ~(BUCKET_INTERVAL_MAX - 1LL)) + BUCKET_INTERVAL_MAX;
= (1690000123456 & ~8191) + 8192
= 1690000122880 + 8192
= 1690000131072
```
The field is stored in a bucket that **ends at** `1690000131072` ms.
### Bucket Alignment Diagram
```
Time (ms) →
|----------------|----------------|----------------|
128ms buckets → 1690000122880 1690000131072
^ ^
| |
expiry floor assigned bucket end
```
## Bucket Placement Logic
- If a suitable bucket **already exists** (i.e., its `end_ts > expiry`), the field is added.
- If no bucket covers the `expiry`, a **new bucket** is created at the computed `end_ts`.
## Bucket Downgrade Conditions
Buckets are downgraded to smaller intervals when overpopulated (>127 fields).
This happens when **all fields fit into a tighter bucket**.
Downgrade rule:
```
(max_expiry & ~(BUCKET_INTERVAL_MIN - 1LL)) + BUCKET_INTERVAL_MIN < current_bucket_ts
```
If the above holds, all fields can be moved to a tighter bucket interval.
### Downgrade Bucket — Diagram
```
Before downgrade:
Current Bucket (8192 ms)
|----------------------------------------|
| Field A | Field B | Field C | Field D |
| exp=+30 | +200 | +500 | +1500 |
|----------------------------------------|
↑
All expiries fall before tighter boundary
After downgrade to 1024 ms:
New Bucket (1024 ms)
|------------------|
| A | B | C | D |
|------------------|
```
### Bucket Split Strategy
If downgrade is not possible, the bucket is **split**:
- Fields are sorted by expiration time.
- A subset that fits in an earlier bucket is moved out.
- Remaining fields stay in the original bucket.
### Split Bucket — Diagram
```
Before split:
Large Bucket (8192 ms)
|--------------------------------------------------|
| A | B | C | D | E | F | G | H | I | J | ... | Z |
|---------------- Sorted by expiry ---------------|
↑
Fields A–L can be moved to an earlier bucket
After split:
Bucket 1 (end=1690000129024) Bucket 2 (end=1690000131072)
|------------------------| |------------------------|
| A | B | C | ... | L | | M | N | O | ... | Z |
|------------------------| |------------------------|
```
## Summary of Bucket Behavior
| Scenario | Action Taken |
|--------------------------------|------------------------------|
| No bucket covers expiry | New bucket is created |
| Existing bucket fits | Field is added |
| Bucket overflows (>127 fields) | Downgrade or split attempted |
API Changes:
------------
Create/Free:
void vsetInit(vset *set);
void vsetClear(vset *set);
Mutation:
bool vsetAddEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetRemoveEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetUpdateEntry(vset *set, vsetGetExpiryFunc getExpiry, void *old_entry,
void *new_entry, long long old_expiry,
long long new_expiry);
Expiry Retrieval:
long long vsetEstimatedEarliestExpiry(vset *set, vsetGetExpiryFunc getExpiry);
size_t vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, vsetExpiryFunc expiryFunc, mstime_t now, size_t max_count, void *ctx);
Utilities:
bool vsetIsEmpty(vset *set);
size_t vsetMemUsage(vset *set);
Iteration:
void vsetStart(vset *set, vsetIterator *it);
bool vsetNext(vsetIterator *it, void **entryptr);
void vsetStop(vsetIterator *it);
Entry Requirements:
-------------------
All entries must conform to the following interface via `volatileEntryType`:
sds entryGetKey(const void entry); // for deduplication
long long getExpiry(const void entry); // used for bucketing
int expire(void db, void o, void entry); // used for expiration callbacks
Diagrams:
---------
1. Tagged Pointer Representation
-----------------------------
Lower 3 bits of `expiry_buckets` encode bucket type:
+------------------------------+
| pointer | TAG (3b) |
+------------------------------+
↑
masked via VSET_PTR_MASK
TAG values:
0x1 → SINGLE
0x2 → VECTOR
0x4 → HT
0x6 → RAX
2. Evolution of the Bucket
------------------------
*Volatile set top-level structure:*
```
+--------+ +--------+ +--------+ +--------+
| NONE | --> | SINGLE | --> | VECTOR | --> | RAX |
+--------+ +--------+ +--------+ +--------+
```
*If the top-level element is a RAX, it has child buckets of type:*
```
+--------+ +--------+ +-----------+
| SINGLE | --> | VECTOR | --> | HASHTABLE |
+--------+ +--------+ +-----------+
```
*Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).*
3. RAX Structure with Expiry-Aligned Keys
--------------------------------------
Buckets in RAX are indexed by aligned expiry timestamps:
+------------------------------+
| RAX key (bucket_ts) → Bucket|
+------------------------------+
| 0x00000020 → VECTOR |
| 0x00000040 → VECTOR |
| 0x00000060 → HT |
+------------------------------+
4. Bucket Splitting (Inside RAX)
-----------------------------
If a vector bucket in a RAX fills:
- Binary search for best split point.
- Use `getExpiry(entry)` + `get_bucket_ts()` to find transition.
- Create 2 new buckets and update RAX.
Original:
[entry1, entry2, ..., entryN] ← bucket_ts = 64ms
After split:
[entry1, ..., entryK] → bucket_ts = 32ms
[entryK+1, ..., entryN] → bucket_ts = 64ms
If all entries share same bucket_ts → promote to HT.
5. Shrinking Behavior
------------------
On deletion:
- HT may shrink to VECTOR.
- VECTOR with 1 item → becomes SINGLE.
- If RAX has only one key left, it’s promoted up.
Summary:
--------
This redesign provides:
✓ Fine-grained memory control
✓ High scalability for bursty TTL data
✓ Fast expiry checks via windowed organization
✓ Minimal overhead for sparse sets
✓ Flexible binary-search-based sorting and bucketing
It also lays the groundwork for future enhancements, including metrics,
prioritized expiry policies, or segmented cleaning.
Signed-off-by: Ran Shidlansik <[email protected]>
…y-io#2257) **Current state** During `hashtableScanDefrag`, rehashing is paused to prevent entries from moving, but the scan callback can still delete entries which triggers `hashtableShrinkIfNeeded`. For example, the `expireScanCallback` can delete expired entries. **Issue** This can cause the table to be resized and the old memory to be freed while the scan is still accessing it, resulting in the following memory access violation: ``` [err]: Sanitizer error: ================================================================= ==46774==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000003100 at pc 0x0000004704d3 bp 0x7fffcb062000 sp 0x7fffcb061ff0 READ of size 1 at 0x611000003100 thread T0 #0 0x4704d2 in isPositionFilled /home/gusakovy/Projects/valkey/src/hashtable.c:422 #1 0x478b45 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1768 #2 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729 #3 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402 #4 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297 #5 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269 #6 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577 #7 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370 valkey-io#8 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513 valkey-io#9 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543 valkey-io#10 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291 valkey-io#11 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139) valkey-io#12 0x452e39 in _start (/local/home/gusakovy/Projects/valkey/src/valkey-server+0x452e39) 0x611000003100 is located 0 bytes inside of 256-byte region [0x611000003100,0x611000003200) freed by thread T0 here: #0 0x7f471a34a1e5 in __interceptor_free (/lib64/libasan.so.4+0xd81e5) #1 0x4aefbc in zfree_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:400 #2 0x4aeff5 in valkey_free /home/gusakovy/Projects/valkey/src/zmalloc.c:415 #3 0x4707d2 in rehashingCompleted /home/gusakovy/Projects/valkey/src/hashtable.c:456 #4 0x471b5b in resize /home/gusakovy/Projects/valkey/src/hashtable.c:656 #5 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272 #6 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448 #7 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459 valkey-io#8 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847 valkey-io#9 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490 valkey-io#10 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831 valkey-io#11 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844 valkey-io#12 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70 valkey-io#13 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139 valkey-io#14 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770 valkey-io#15 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729 valkey-io#16 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402 valkey-io#17 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297 valkey-io#18 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269 valkey-io#19 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577 valkey-io#20 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370 valkey-io#21 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513 valkey-io#22 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543 valkey-io#23 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291 valkey-io#24 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139) previously allocated by thread T0 here: #0 0x7f471a34a753 in __interceptor_calloc (/lib64/libasan.so.4+0xd8753) #1 0x4ae48c in ztrycalloc_usable_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:214 #2 0x4ae757 in valkey_calloc /home/gusakovy/Projects/valkey/src/zmalloc.c:257 #3 0x4718fc in resize /home/gusakovy/Projects/valkey/src/hashtable.c:645 #4 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272 #5 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448 #6 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459 #7 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847 valkey-io#8 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490 valkey-io#9 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831 valkey-io#10 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844 valkey-io#11 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70 valkey-io#12 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139 valkey-io#13 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770 valkey-io#14 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729 valkey-io#15 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402 valkey-io#16 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297 valkey-io#17 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269 valkey-io#18 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577 valkey-io#19 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370 valkey-io#20 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513 valkey-io#21 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543 valkey-io#22 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291 valkey-io#23 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139) SUMMARY: AddressSanitizer: heap-use-after-free /home/gusakovy/Projects/valkey/src/hashtable.c:422 in isPositionFilled Shadow bytes around the buggy address: 0x0c227fff85d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff85e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff85f0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd 0x0c227fff8600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff8610: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa =>0x0c227fff8620:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff8630: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff8640: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8650: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8660: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8670: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==46774==ABORTING ``` **Solution** Suggested solution is to also pause auto shrinking during `hashtableScanDefrag`. I noticed that there was already a `hashtablePauseAutoShrink` method and `pause_auto_shrink` counter, but it wasn't actually used in `hashtableShrinkIfNeeded` so I fixed that. **Testing** I created a simple tcl test that (most of the times) triggers this error, but it's a little clunky so I didn't add it as part of the PR: ``` start_server {tags {"expire hashtable defrag"}} { test {hashtable scan defrag on expiry} { r config set hz 100 set num_keys 20 for {set i 0} {$i < $num_keys} {incr i} { r set "key_$i" "value_$i" } for {set j 0} {$j < 50} {incr j} { set expire_keys 100 for {set i 0} {$i < $expire_keys} {incr i} { # Short expiry time to ensure they expire quickly r psetex "expire_key_${i}_${j}" 100 "expire_value_${i}_${j}" } # Verify keys are set set initial_size [r dbsize] assert_equal $initial_size [expr $num_keys + $expire_keys] after 150 for {set i 0} {$i < 10} {incr i} { r get "expire_key_${i}_${j}" after 10 } } set remaining_keys [r dbsize] assert_equal $remaining_keys $num_keys # Verify server is still responsive assert_equal [r ping] {PONG} } {} } ``` Compiling with ASAN using `make noopt SANITIZER=address valkey-server` and running the test causes error above. Applying the fix resolves the issue. Signed-off-by: Yakov Gusakov <[email protected]>
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL subcommand. The intended behavior was to pick the last value of the filter. However, we introduced memory leak for all the preceding filters. Before this change: ``` > CLIENT LIST IP 127.0.0.1 IP 127.0.0.1 id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0 ``` Leak: ``` Direct leak of 11 byte(s) in 1 object(s) allocated from: #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d) #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156 #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200 #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113 #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264 #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600 #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772 #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434 valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571 valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702 valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812 valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79 valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301 valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486 valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543 valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319 valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139) ``` Note: For filter ID / NOT-ID we group all the option and perform filtering whereas for remaining filters we only pick the last filter option. --------- Signed-off-by: Harkrishn Patro <[email protected]>
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL subcommand. The intended behavior was to pick the last value of the filter. However, we introduced memory leak for all the preceding filters. Before this change: ``` > CLIENT LIST IP 127.0.0.1 IP 127.0.0.1 id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0 ``` Leak: ``` Direct leak of 11 byte(s) in 1 object(s) allocated from: #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d) #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156 #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200 #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113 #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264 #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600 #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772 #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434 valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571 valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702 valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812 valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79 valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301 valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486 valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543 valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319 valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139) ``` Note: For filter ID / NOT-ID we group all the option and perform filtering whereas for remaining filters we only pick the last filter option. --------- Signed-off-by: Harkrishn Patro <[email protected]> (cherry picked from commit 155b0bb) Signed-off-by: cherukum-amazon <[email protected]>
Overview:
This PR introduces a complete redesign of the
volatile_setdata structure,creating an adaptive container for expiring entries. The new design is
memory-efficient, scalable, and dynamically promotes/demotes its internal
representation depending on runtime behavior and volume.
The core concept uses a single tagged pointer (
expiry_buckets) that encodesone of several internal structures:
- NONE (-1): Empty set
- SINGLE (0x1): One entry
- VECTOR (0x2): Sorted vector of entry pointers
- HT (0x4): Hash table for larger buckets with many entries
- RAX (0x6): Radix tree (keyed by aligned expiry timestamps)
This allows the set to grow and shrink seamlessly while optimizing for both
space and performance.
Motivation:
The previous design lacked flexibility in high-churn environments or
workloads with skewed expiry distributions. This redesign enables dynamic
layout adjustment based on the time distribution and volume of the inserted
entries, while maintaining fast expiry checks and minimal memory overhead.
Key Concepts:
All pointers stored in the structure must be odd-aligned to preserve
3 bits for tagging. This is safe with SDS strings (which set the LSB).
Buckets evolve automatically:
Each vector bucket is kept sorted by
entry->getExpiry().Binary search is used for efficient insertion and splitting.
Expiry window alignment is managed by:
VOLATILESET_BUCKET_INTERVAL_MIN = 16ms
VOLATILESET_BUCKET_INTERVAL_MAX = 8192ms
API Changes:
Create/Free:
void vsetInit(vset *set);
void vsetClear(vset *set);
Mutation:
bool vsetAddEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetRemoveEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetUpdateEntry(vset *set, vsetGetExpiryFunc getExpiry, void *old_entry,
void *new_entry, long long old_expiry,
long long new_expiry);
Expiry Retrieval:
void *vsetFirstExpired(vset *set, vsetGetExpiryFunc getExpiry, mstime_t now);
void *vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, mstime_t now);
Utilities:
bool vsetIsEmpty(vset *set);
Iteration:
void vsetStart(vset *set, vsetIterator *it);
bool vsetNext(vsetIterator *it, void **entryptr);
void vsetStop(vsetIterator *it);
Entry Requirements:
All entries must conform to the following interface via
volatileEntryType:Diagrams:
Tagged Pointer Representation
Lower 3 bits of
expiry_bucketsencode bucket type:+------------------------------+
| pointer | TAG (3b) |
+------------------------------+
↑
masked via VSET_PTR_MASK
TAG values:
0x1 → SINGLE
0x2 → VECTOR
0x4 → HT
0x6 → RAX
Evolution of the Bucket
Volatile set top-level structure:
If the top-level element is a RAX, it has child buckets of type:
Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).
RAX Structure with Expiry-Aligned Keys
Buckets in RAX are indexed by aligned expiry timestamps:
+------------------------------+
| RAX key (bucket_ts) → Bucket|
+------------------------------+
| 0x00000020 → VECTOR |
| 0x00000040 → VECTOR |
| 0x00000060 → HT |
+------------------------------+
Bucket Splitting (Inside RAX)
If a vector bucket in a RAX fills:
getExpiry(entry)+get_bucket_ts()to find transition.Original:
[entry1, entry2, ..., entryN] ← bucket_ts = 64ms
After split:
[entry1, ..., entryK] → bucket_ts = 32ms
[entryK+1, ..., entryN] → bucket_ts = 64ms
If all entries share same bucket_ts → promote to HT.
Shrinking Behavior
On deletion:
Summary:
This redesign provides:
✓ Fine-grained memory control
✓ High scalability for bursty TTL data
✓ Fast expiry checks via windowed organization
✓ Minimal overhead for sparse sets
✓ Flexible binary-search-based sorting and bucketing
It also lays the groundwork for future enhancements, including metrics,
prioritized expiry policies, or segmented cleaning.