Skip to content

Conversation

@ranshid
Copy link
Owner

@ranshid ranshid commented Jun 26, 2025

Overview:

This PR introduces a complete redesign of the volatile_set data structure,
creating an adaptive container for expiring entries. The new design is
memory-efficient, scalable, and dynamically promotes/demotes its internal
representation depending on runtime behavior and volume.

The core concept uses a single tagged pointer (expiry_buckets) that encodes
one of several internal structures:
- NONE (-1): Empty set
- SINGLE (0x1): One entry
- VECTOR (0x2): Sorted vector of entry pointers
- HT (0x4): Hash table for larger buckets with many entries
- RAX (0x6): Radix tree (keyed by aligned expiry timestamps)

This allows the set to grow and shrink seamlessly while optimizing for both
space and performance.

Motivation:

The previous design lacked flexibility in high-churn environments or
workloads with skewed expiry distributions. This redesign enables dynamic
layout adjustment based on the time distribution and volume of the inserted
entries, while maintaining fast expiry checks and minimal memory overhead.

Key Concepts:

  • All pointers stored in the structure must be odd-aligned to preserve
    3 bits for tagging. This is safe with SDS strings (which set the LSB).

  • Buckets evolve automatically:

    • Start as NONE.
    • On first insert → become SINGLE.
    • If another entry with similar expiry → promote to VECTOR.
    • If VECTOR exceeds 127 entries → convert to RAX.
    • If a RAX bucket's vector fills and cannot split → promote to HT.
  • Each vector bucket is kept sorted by entry->getExpiry().

  • Binary search is used for efficient insertion and splitting.

  • Expiry window alignment is managed by:
    VOLATILESET_BUCKET_INTERVAL_MIN = 16ms
    VOLATILESET_BUCKET_INTERVAL_MAX = 8192ms

API Changes:

Create/Free:
void vsetInit(vset *set);
void vsetClear(vset *set);

Mutation:
bool vsetAddEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetRemoveEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetUpdateEntry(vset *set, vsetGetExpiryFunc getExpiry, void *old_entry,
void *new_entry, long long old_expiry,
long long new_expiry);

Expiry Retrieval:
void *vsetFirstExpired(vset *set, vsetGetExpiryFunc getExpiry, mstime_t now);
void *vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, mstime_t now);

Utilities:
bool vsetIsEmpty(vset *set);

Iteration:
void vsetStart(vset *set, vsetIterator *it);
bool vsetNext(vsetIterator *it, void **entryptr);
void vsetStop(vsetIterator *it);

Entry Requirements:

All entries must conform to the following interface via volatileEntryType:

   sds entryGetKey(const void  entry);         // for deduplication
   long long getExpiry(const void  entry);     // used for bucketing
   int expire(void  db, void  o, void  entry); // used for expiration callbacks

Diagrams:

  1. Tagged Pointer Representation

    Lower 3 bits of expiry_buckets encode bucket type:

    +------------------------------+
    | pointer | TAG (3b) |
    +------------------------------+

    masked via VSET_PTR_MASK

    TAG values:
    0x1 → SINGLE
    0x2 → VECTOR
    0x4 → HT
    0x6 → RAX

  2. Evolution of the Bucket

Volatile set top-level structure:

+--------+     +--------+     +--------+     +--------+
| NONE   | --> | SINGLE | --> | VECTOR | --> |   RAX  |
+--------+     +--------+     +--------+     +--------+

If the top-level element is a RAX, it has child buckets of type:

+--------+     +--------+     +-----------+
| SINGLE | --> | VECTOR | --> | HASHTABLE |
+--------+     +--------+     +-----------+

Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).

  1. RAX Structure with Expiry-Aligned Keys

    Buckets in RAX are indexed by aligned expiry timestamps:

    +------------------------------+
    | RAX key (bucket_ts) → Bucket|
    +------------------------------+
    | 0x00000020 → VECTOR |
    | 0x00000040 → VECTOR |
    | 0x00000060 → HT |
    +------------------------------+

  2. Bucket Splitting (Inside RAX)

    If a vector bucket in a RAX fills:

    • Binary search for best split point.
    • Use getExpiry(entry) + get_bucket_ts() to find transition.
    • Create 2 new buckets and update RAX.

    Original:
    [entry1, entry2, ..., entryN] ← bucket_ts = 64ms

    After split:
    [entry1, ..., entryK] → bucket_ts = 32ms
    [entryK+1, ..., entryN] → bucket_ts = 64ms

    If all entries share same bucket_ts → promote to HT.

  3. Shrinking Behavior

    On deletion:

    • HT may shrink to VECTOR.
    • VECTOR with 1 item → becomes SINGLE.
    • If RAX has only one key left, it’s promoted up.

Summary:

This redesign provides:
✓ Fine-grained memory control
✓ High scalability for bursty TTL data
✓ Fast expiry checks via windowed organization
✓ Minimal overhead for sparse sets
✓ Flexible binary-search-based sorting and bucketing

It also lays the groundwork for future enhancements, including metrics,
prioritized expiry policies, or segmented cleaning.

ranshid added 4 commits June 26, 2025 19:27
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Copy link

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this design can work well.

It just needs some polishing regarding naming and snake case / camel case, etc.

I quite like the prefix "vset" for volatile set. I think you should rename the struct to vset, the function prefix of all of these to vset and the file names to vset.{c,h}.

If we want to use this also for the main keyspace later (where the elements are not odd-aligned) we can for example skip single-element encoding if we try to add an element that is not odd-aligned. In that case, we can go directly to vector encoding.

Signed-off-by: Ran Shidlansik <[email protected]>
@ranshid
Copy link
Owner Author

ranshid commented Jun 27, 2025

Yes, I believe this design can work well.

It just needs some polishing regarding naming and snake case / camel case, etc.

+1 I agree, will make order to it.

I quite like the prefix "vset" for volatile set. I think you should rename the struct to vset, the function prefix of all of these to vset and the file names to vset.{c,h}.

wel, it is possible.

If we want to use this also for the main keyspace later (where the elements are not odd-aligned) we can for example skip single-element encoding if we try to add an element that is not odd-aligned. In that case, we can go directly to vector encoding.

So for objects we can still do it by passing the ket sds pointer. since we know that the object will have the expiry we can go to the allocation start. I think the main challenge for generic keys is maintaining the updates (when the object is relocated, but I do not think that is something that happens outside of defrag, since once the object has expiry it will probably not be relocated). Another challenge is making the vset per slot. this will require some implementation ~kvstore (or maybe add it to the kvstore somehow).
regarding your suggestion to go to vector - it might work, but ATM lets keep this restriction.

@ranshid
Copy link
Owner Author

ranshid commented Jun 27, 2025

@zuiderkwast Thank you sooooooo much for taking the time to review this. I will make sure to work on your comment first thing on Sunday

ranshid added 8 commits June 27, 2025 11:25
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
2. change pointer_vector to pVecotr
3. multiple pr comments fix

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
… entry expiry.

This is the first change in order to reduce the vset default memory consumption.
Although this complicates the API, it allows reducing the memory footprint per
each hash object using the set.

The next potential step is to make the vset a pure bucket pointer so that it will not use any extra memory.

I intentionally separated these changes in order for us to be able to decide if "sacrifice" API friendly is better
than consuming more memory

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
@xbasel xbasel self-requested a review June 29, 2025 17:56
Signed-off-by: Ran Shidlansik <[email protected]>
Copy link

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass on this. I tend to get stuck on details like naming rather than reviewing the actual logic. :D

@ranshid
Copy link
Owner Author

ranshid commented Jun 30, 2025

Another pass on this. I tend to get stuck on details like naming rather than reviewing the actual logic. :D

I understand. maybe in this specific review case the naming, style and the external API are the more important thing.

regarding the algorithm, I think I can help focus you on several questions:

  1. the design is attempting to keep memory access local. this means that we would prefer not to access the entries expiration time too often. Still I choose to keep vectors sorted. I did that since it simplified the implementation. an alternative would be to just sort the vector before attempt to split it and for the other API cases we can just search the entire vector when we need to (more expensive than just access the minimum)
  2. the entire pop/first API was created since the rax iterator is potentially invalidated on mutative actions (like remove and add). Still I think the pop should be improved. I think the API should be able to accept a pop function (i.e hashTypeExpireEntry in our case) and a max_count argument like:
size_t vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, vsetExpiryFunc expiryFunc, mstime_t now, size_t count);

so it will internally iterate on buckets and remove entries up to max_count OR when the expiryFunc tels it to stop.
Why? because when expiring entries we would also like to potentially remove the containing object HASH (so the pointer to the set is lost) and perform other mutative actions. This is, however more in order to improve the activeExpiration performance and so I decided to hold with this change till this PR is up in the air.

ranshid added 4 commits June 30, 2025 11:11
Signed-off-by: Ran Shidlansik <[email protected]>
1. Fix unittest
2. change create and delete vset functions to init and clear

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Copy link

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm starting to grasp this design.

the design is attempting to keep memory access local. this means that we would prefer not to access the entries expiration time too often. Still I choose to keep vectors sorted. I did that since it simplified the implementation. an alternative would be to just sort the vector before attempt to split it and for the other API cases we can just search the entire vector when we need to (more expensive than just access the minimum)

Just an idea: If you want to use unsorted vectors, use a max size of 8. This makes sure the unsorted vector fits in a cache line, so linear search is still really fast. (It guess it can even be made O(1) using AVX512.)

* Expiry Buckets and Pointer Tagging
*-----------------------------------------------------------------------------
*
* Internally, the `vset` maintains a single `vsetBucket*` pointer,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hashtable.{c,h} I put the API overview in the .h file and the internals overview in the .c file. Not sure if it matters...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try and polish it near the end of it.

@zuiderkwast
Copy link

The PR top-comment will maybe not be kept, but just thinking of the Evolution of the Bucket diagram. It makes no distinction between SINGLE and VECTOR on the top-level vs within a rax. These are quite different and I got confused by the diagram at first.

Idea below the line:


Volatile set top-level structure:

+--------+     +--------+     +--------+     +--------+
| NONE   | --> | SINGLE | --> | VECTOR | --> |   RAX  |
+--------+     +--------+     +--------+     +--------+

If the top-level element is a RAX, it has child buckets of type:

+--------+     +--------+     +-----------+
| SINGLE | --> | VECTOR | --> | HASHTABLE |
+--------+     +--------+     +-----------+

Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).

@ranshid
Copy link
Owner Author

ranshid commented Jun 30, 2025

The PR top-comment will maybe not be kept, but just thinking of the Evolution of the Bucket diagram. It makes no distinction between SINGLE and VECTOR on the top-level vs within a rax. These are quite different and I got confused by the diagram at first.

Idea below the line:

Volatile set top-level structure:

+--------+     +--------+     +--------+     +--------+
| NONE   | --> | SINGLE | --> | VECTOR | --> |   RAX  |
+--------+     +--------+     +--------+     +--------+

If the top-level element is a RAX, it has child buckets of type:

+--------+     +--------+     +-----------+
| SINGLE | --> | VECTOR | --> | HASHTABLE |
+--------+     +--------+     +-----------+

Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).

Yeh. It does hint about it, since the HT is only shown as a rax bucket, But I like your style better. I will change that

@ranshid
Copy link
Owner Author

ranshid commented Jun 30, 2025

I see there are new memory leaks. will search for it.

ranshid added 2 commits June 30, 2025 19:04
Signed-off-by: Ran Shidlansik <[email protected]>
Copy link

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you describe somewhere when and where the vector is sorted vs unsorted?

vsetGetExpiryFunc getExpiry = vsetGetExpiryGetter();
long long ea = getExpiry(*(void **)a);
long long eb = getExpiry(*(void **)b);
return (ea > eb) - (ea < eb);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one-line comparison.

Would return a - b; work too?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one-line comparison.

Would return a - b; work too?

Well, not if we want to be compatible to returning [-1, 0, 1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required IIUC. From man qsort:

The comparison function must return an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is possible

@ranshid
Copy link
Owner Author

ranshid commented Jul 1, 2025

Will you describe somewhere when and where the vector is sorted vs unsorted?

Sure.

ranshid pushed a commit that referenced this pull request Jul 1, 2025
…y-io#2257)

**Current state**
During `hashtableScanDefrag`, rehashing is paused to prevent entries
from moving, but the scan callback can still delete entries which
triggers `hashtableShrinkIfNeeded`. For example, the
`expireScanCallback` can delete expired entries.

**Issue**
This can cause the table to be resized and the old memory to be freed
while the scan is still accessing it, resulting in the following memory
access violation:

```
[err]: Sanitizer error: =================================================================
==46774==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000003100 at pc 0x0000004704d3 bp 0x7fffcb062000 sp 0x7fffcb061ff0
READ of size 1 at 0x611000003100 thread T0
    #0 0x4704d2 in isPositionFilled /home/gusakovy/Projects/valkey/src/hashtable.c:422
    #1 0x478b45 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1768
    #2 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    #3 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    #4 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    #5 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    #6 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    #7 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#8 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#9 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#10 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#11 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)
    valkey-io#12 0x452e39 in _start (/local/home/gusakovy/Projects/valkey/src/valkey-server+0x452e39)

0x611000003100 is located 0 bytes inside of 256-byte region [0x611000003100,0x611000003200)
freed by thread T0 here:
    #0 0x7f471a34a1e5 in __interceptor_free (/lib64/libasan.so.4+0xd81e5)
    #1 0x4aefbc in zfree_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:400
    #2 0x4aeff5 in valkey_free /home/gusakovy/Projects/valkey/src/zmalloc.c:415
    #3 0x4707d2 in rehashingCompleted /home/gusakovy/Projects/valkey/src/hashtable.c:456
    #4 0x471b5b in resize /home/gusakovy/Projects/valkey/src/hashtable.c:656
    #5 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #6 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #7 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    valkey-io#8 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#9 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#10 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#11 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#12 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#13 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#14 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#15 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#16 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#17 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#18 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#19 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#20 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#21 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#22 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#23 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#24 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

previously allocated by thread T0 here:
    #0 0x7f471a34a753 in __interceptor_calloc (/lib64/libasan.so.4+0xd8753)
    #1 0x4ae48c in ztrycalloc_usable_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:214
    #2 0x4ae757 in valkey_calloc /home/gusakovy/Projects/valkey/src/zmalloc.c:257
    #3 0x4718fc in resize /home/gusakovy/Projects/valkey/src/hashtable.c:645
    #4 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #5 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #6 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    #7 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#8 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#9 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#10 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#11 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#12 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#13 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#14 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#15 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#16 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#17 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#18 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#19 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#20 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#21 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#22 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#23 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

SUMMARY: AddressSanitizer: heap-use-after-free /home/gusakovy/Projects/valkey/src/hashtable.c:422 in isPositionFilled
Shadow bytes around the buggy address:
  0x0c227fff85d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85f0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c227fff8600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8610: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
=>0x0c227fff8620:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8630: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8640: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8650: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8660: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8670: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==46774==ABORTING
```


**Solution**
Suggested solution is to also pause auto shrinking during
`hashtableScanDefrag`. I noticed that there was already a
`hashtablePauseAutoShrink` method and `pause_auto_shrink` counter, but
it wasn't actually used in `hashtableShrinkIfNeeded` so I fixed that.

**Testing**
I created a simple tcl test that (most of the times) triggers this
error, but it's a little clunky so I didn't add it as part of the PR:

```
start_server {tags {"expire hashtable defrag"}} {
    test {hashtable scan defrag on expiry} {

        r config set hz 100

        set num_keys 20
        for {set i 0} {$i < $num_keys} {incr i} {
            r set "key_$i" "value_$i"
        }

        for {set j 0} {$j < 50} {incr j} {
            set expire_keys 100
            for {set i 0} {$i < $expire_keys} {incr i} {
                # Short expiry time to ensure they expire quickly
                r psetex "expire_key_${i}_${j}" 100 "expire_value_${i}_${j}"
            }

            # Verify keys are set
            set initial_size [r dbsize]
            assert_equal $initial_size [expr $num_keys + $expire_keys]
            
            after 150
            for {set i 0} {$i < 10} {incr i} {
                r get "expire_key_${i}_${j}"
                after 10
            }
        }

        set remaining_keys [r dbsize]
        assert_equal $remaining_keys $num_keys

        # Verify server is still responsive
        assert_equal [r ping] {PONG}
    } {}
}
```
Compiling with ASAN using `make noopt SANITIZER=address valkey-server`
and running the test causes error above. Applying the fix resolves the
issue.

Signed-off-by: Yakov Gusakov <[email protected]>
ranshid added 2 commits July 1, 2025 17:41
After the changes to vset, we need to check is the set is empty and not check it's pointer

Signed-off-by: Ran Shidlansik <[email protected]>
@ranshid
Copy link
Owner Author

ranshid commented Jul 1, 2025

I see there are new memory leaks. will search for it.

nasty bug. should be fixed now

@ranshid
Copy link
Owner Author

ranshid commented Jul 3, 2025

note for the reviewers:

I think in the current implementation vsetUpdateEntry is very sub-optimal. it always remove and re-insert the entry. I think this performs bad in case of defrag and also in cases were the entry is not switching buckets. I will update the code with change for this.

@ranshid
Copy link
Owner Author

ranshid commented Jul 3, 2025

note for the reviewers:

I think in the current implementation vsetUpdateEntry is very sub-optimal. it always remove and re-insert the entry. I think this performs bad in case of defrag and also in cases were the entry is not switching buckets. I will update the code with change for this.

done

@ranshid ranshid merged commit 413f3a1 into ttl-poc-new Jul 4, 2025
28 of 34 checks passed
@ranshid ranshid mentioned this pull request Jul 4, 2025
@ranshid
Copy link
Owner Author

ranshid commented Jul 4, 2025

So this PR was accidentally merged to the origin branch. I reverted these changes but I cannot reopen the PR and most of the comments here are cleared. lets move to work on #5

ranshid added a commit that referenced this pull request Aug 5, 2025
-------------

   Overview:
   ---------
   This PR introduces a complete redesign of the 'vset' (stands for volatile set) data structure,
   creating an adaptive container for expiring entries. The new design is
   memory-efficient, scalable, and dynamically promotes/demotes its internal
   representation depending on runtime behavior and volume.

   The core concept uses a single tagged pointer (`expiry_buckets`) that encodes
   one of several internal structures:
       - NONE   (-1): Empty set
       - SINGLE (0x1): One entry
       - VECTOR (0x2): Sorted vector of entry pointers
       - HT     (0x4): Hash table for larger buckets with many entries
       - RAX    (0x6): Radix tree (keyed by aligned expiry timestamps)

   This allows the set to grow and shrink seamlessly while optimizing for both
   space and performance.

   Motivation:
   -----------
   The previous design lacked flexibility in high-churn environments or
   workloads with skewed expiry distributions. This redesign enables dynamic
   layout adjustment based on the time distribution and volume of the inserted
   entries, while maintaining fast expiry checks and minimal memory overhead.

   Key Concepts:
   -------------
   - All pointers stored in the structure must be   odd-aligned   to preserve
     3 bits for tagging. This is safe with SDS strings (which set the LSB).

   - Buckets evolve automatically:
       - Start as NONE.
       - On first insert → become SINGLE.
       - If another entry with similar expiry → promote to VECTOR.
       - If VECTOR exceeds 127 entries → convert to RAX.
       - If a RAX bucket's vector fills and cannot split → promote to HT.

   - Each vector bucket is kept sorted by `entry->getExpiry()`.
   - Binary search is used for efficient insertion and splitting.

 # Coarse Buckets Expiration System for Hash Fields

 This PR introduces **coarse-grained expiration buckets** to support per-field
 expirations in hash types — a feature known as *volatile fields*.

 It enables scalable expiration tracking by grouping fields into time-aligned
 buckets instead of individually tracking exact timestamps.

 ## Motivation
 Valkey traditionally supports key-level expiration. However, in many applications,
 there's a strong need to expire individual fields within a hash (e.g., session keys,
 token caches, etc.).

 Tracking these at fine granularity is expensive and potentially unscalable, so
 this implementation introduces *bucketed expirations* to batch expirations together.

 ## Bucket Granularity and Timestamp Handling

 - Each expiration bucket represents a time slice of fixed width (e.g., 8192 ms).
 - Expiring fields are mapped to the **end** of a time slice (not the floor).
 - This design facilitates:
   - Efficient *splitting* of large buckets when needed
   - *Downgrading* buckets when fields permit tighter packing
   - Coalescing during lazy cleanup or memory pressure

 ### Example Calculation

 Suppose a field has an expiration time of `1690000123456` ms and the max bucket
 interval is 8192 ms:

 ```
 BUCKET_INTERVAL_MAX = 8192;
 expiry = 1690000123456;

 bucket_ts = (expiry & ~(BUCKET_INTERVAL_MAX - 1LL)) + BUCKET_INTERVAL_MAX;
           = (1690000123456 & ~8191) + 8192
           = 1690000122880 + 8192
           = 1690000131072
 ```

 The field is stored in a bucket that **ends at** `1690000131072` ms.

 ### Bucket Alignment Diagram

 ```
 Time (ms) →
 |----------------|----------------|----------------|
  128ms buckets → 1690000122880    1690000131072
                     ^               ^
                     |               |
               expiry floor     assigned bucket end
 ```

 ## Bucket Placement Logic

 - If a suitable bucket **already exists** (i.e., its `end_ts > expiry`), the field is added.
 - If no bucket covers the `expiry`, a **new bucket** is created at the computed `end_ts`.

 ## Bucket Downgrade Conditions

 Buckets are downgraded to smaller intervals when overpopulated (>127 fields).
 This happens when **all fields fit into a tighter bucket**.

 Downgrade rule:
 ```
 (max_expiry & ~(BUCKET_INTERVAL_MIN - 1LL)) + BUCKET_INTERVAL_MIN < current_bucket_ts
 ```
 If the above holds, all fields can be moved to a tighter bucket interval.

 ### Downgrade Bucket — Diagram

 ```
 Before downgrade:

   Current Bucket (8192 ms)
   |----------------------------------------|
   | Field A | Field B | Field C | Field D  |
   | exp=+30 | +200    | +500    | +1500    |
   |----------------------------------------|
                     ↑
        All expiries fall before tighter boundary

 After downgrade to 1024 ms:

   New Bucket (1024 ms)
   |------------------|
   | A | B | C | D     |
   |------------------|
 ```

 ### Bucket Split Strategy

 If downgrade is not possible, the bucket is **split**:
 - Fields are sorted by expiration time.
 - A subset that fits in an earlier bucket is moved out.
 - Remaining fields stay in the original bucket.

 ### Split Bucket — Diagram

 ```
 Before split:

   Large Bucket (8192 ms)
   |--------------------------------------------------|
   | A | B | C | D | E | F | G | H | I | J | ... | Z  |
   |---------------- Sorted by expiry ---------------|
             ↑
      Fields A–L can be moved to an earlier bucket

 After split:

   Bucket 1 (end=1690000129024)     Bucket 2 (end=1690000131072)
   |------------------------|       |------------------------|
   | A | B | C | ... | L     |       | M | N | O | ... | Z    |
   |------------------------|       |------------------------|
 ```

 ## Summary of Bucket Behavior
 | Scenario                        | Action Taken                 |
 |--------------------------------|------------------------------|
 | No bucket covers expiry        | New bucket is created        |
 | Existing bucket fits           | Field is added               |
 | Bucket overflows (>127 fields) | Downgrade or split attempted |

API Changes:
------------

   Create/Free:
      void vsetInit(vset *set);
      void vsetClear(vset *set);

  Mutation:
      bool vsetAddEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
      bool vsetRemoveEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
      bool vsetUpdateEntry(vset *set, vsetGetExpiryFunc getExpiry, void *old_entry,
                                 void *new_entry, long long old_expiry,
                                 long long new_expiry);

  Expiry Retrieval:
      long long vsetEstimatedEarliestExpiry(vset *set, vsetGetExpiryFunc getExpiry);
      size_t vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, vsetExpiryFunc expiryFunc, mstime_t now, size_t max_count, void *ctx);

  Utilities:
      bool vsetIsEmpty(vset *set);
      size_t vsetMemUsage(vset *set);

  Iteration:
      void vsetStart(vset *set, vsetIterator *it);
      bool vsetNext(vsetIterator *it, void **entryptr);
      void vsetStop(vsetIterator *it);

   Entry Requirements:
   -------------------
   All entries must conform to the following interface via `volatileEntryType`:

       sds entryGetKey(const void  entry);         // for deduplication
       long long getExpiry(const void  entry);     // used for bucketing
       int expire(void  db, void  o, void  entry); // used for expiration callbacks

   Diagrams:
   ---------

   1. Tagged Pointer Representation
      -----------------------------
      Lower 3 bits of `expiry_buckets` encode bucket type:

         +------------------------------+
         |     pointer       | TAG (3b) |
         +------------------------------+
           ↑
           masked via VSET_PTR_MASK

         TAG values:
           0x1 → SINGLE
           0x2 → VECTOR
           0x4 → HT
           0x6 → RAX

   2. Evolution of the Bucket
      ------------------------

 *Volatile set top-level structure:*

```
+--------+     +--------+     +--------+     +--------+
| NONE   | --> | SINGLE | --> | VECTOR | --> |   RAX  |
+--------+     +--------+     +--------+     +--------+
```

*If the top-level element is a RAX, it has child buckets of type:*

```
+--------+     +--------+     +-----------+
| SINGLE | --> | VECTOR | --> | HASHTABLE |
+--------+     +--------+     +-----------+
```

*Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).*

   3. RAX Structure with Expiry-Aligned Keys
      --------------------------------------
      Buckets in RAX are indexed by aligned expiry timestamps:

         +------------------------------+
         | RAX key (bucket_ts) → Bucket|
         +------------------------------+
         |     0x00000020  → VECTOR     |
         |     0x00000040  → VECTOR     |
         |     0x00000060  → HT         |
         +------------------------------+

   4. Bucket Splitting (Inside RAX)
      -----------------------------
      If a vector bucket in a RAX fills:
        - Binary search for best split point.
        - Use `getExpiry(entry)` + `get_bucket_ts()` to find transition.
        - Create 2 new buckets and update RAX.

         Original:
             [entry1, entry2, ..., entryN]  ← bucket_ts = 64ms

         After split:
             [entry1, ..., entryK]  → bucket_ts = 32ms
             [entryK+1, ..., entryN] → bucket_ts = 64ms

      If all entries share same bucket_ts → promote to HT.

   5. Shrinking Behavior
      ------------------
      On deletion:
        - HT may shrink to VECTOR.
        - VECTOR with 1 item → becomes SINGLE.
        - If RAX has only one key left, it’s promoted up.

   Summary:
   --------
   This redesign provides:
     ✓ Fine-grained memory control
     ✓ High scalability for bursty TTL data
     ✓ Fast expiry checks via windowed organization
     ✓ Minimal overhead for sparse sets
     ✓ Flexible binary-search-based sorting and bucketing

   It also lays the groundwork for future enhancements, including metrics,
   prioritized expiry policies, or segmented cleaning.

Signed-off-by: Ran Shidlansik <[email protected]>
ranshid pushed a commit that referenced this pull request Sep 30, 2025
…y-io#2257)

**Current state**
During `hashtableScanDefrag`, rehashing is paused to prevent entries
from moving, but the scan callback can still delete entries which
triggers `hashtableShrinkIfNeeded`. For example, the
`expireScanCallback` can delete expired entries.

**Issue**
This can cause the table to be resized and the old memory to be freed
while the scan is still accessing it, resulting in the following memory
access violation:

```
[err]: Sanitizer error: =================================================================
==46774==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000003100 at pc 0x0000004704d3 bp 0x7fffcb062000 sp 0x7fffcb061ff0
READ of size 1 at 0x611000003100 thread T0
    #0 0x4704d2 in isPositionFilled /home/gusakovy/Projects/valkey/src/hashtable.c:422
    #1 0x478b45 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1768
    #2 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    #3 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    #4 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    #5 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    #6 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    #7 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#8 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#9 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#10 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#11 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)
    valkey-io#12 0x452e39 in _start (/local/home/gusakovy/Projects/valkey/src/valkey-server+0x452e39)

0x611000003100 is located 0 bytes inside of 256-byte region [0x611000003100,0x611000003200)
freed by thread T0 here:
    #0 0x7f471a34a1e5 in __interceptor_free (/lib64/libasan.so.4+0xd81e5)
    #1 0x4aefbc in zfree_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:400
    #2 0x4aeff5 in valkey_free /home/gusakovy/Projects/valkey/src/zmalloc.c:415
    #3 0x4707d2 in rehashingCompleted /home/gusakovy/Projects/valkey/src/hashtable.c:456
    #4 0x471b5b in resize /home/gusakovy/Projects/valkey/src/hashtable.c:656
    #5 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #6 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #7 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    valkey-io#8 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#9 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#10 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#11 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#12 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#13 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#14 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#15 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#16 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#17 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#18 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#19 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#20 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#21 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#22 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#23 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#24 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

previously allocated by thread T0 here:
    #0 0x7f471a34a753 in __interceptor_calloc (/lib64/libasan.so.4+0xd8753)
    #1 0x4ae48c in ztrycalloc_usable_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:214
    #2 0x4ae757 in valkey_calloc /home/gusakovy/Projects/valkey/src/zmalloc.c:257
    #3 0x4718fc in resize /home/gusakovy/Projects/valkey/src/hashtable.c:645
    #4 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #5 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #6 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    #7 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#8 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#9 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#10 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#11 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#12 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#13 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#14 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#15 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#16 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#17 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#18 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#19 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#20 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#21 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#22 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#23 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

SUMMARY: AddressSanitizer: heap-use-after-free /home/gusakovy/Projects/valkey/src/hashtable.c:422 in isPositionFilled
Shadow bytes around the buggy address:
  0x0c227fff85d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85f0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c227fff8600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8610: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
=>0x0c227fff8620:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8630: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8640: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8650: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8660: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8670: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==46774==ABORTING
```


**Solution**
Suggested solution is to also pause auto shrinking during
`hashtableScanDefrag`. I noticed that there was already a
`hashtablePauseAutoShrink` method and `pause_auto_shrink` counter, but
it wasn't actually used in `hashtableShrinkIfNeeded` so I fixed that.

**Testing**
I created a simple tcl test that (most of the times) triggers this
error, but it's a little clunky so I didn't add it as part of the PR:

```
start_server {tags {"expire hashtable defrag"}} {
    test {hashtable scan defrag on expiry} {

        r config set hz 100

        set num_keys 20
        for {set i 0} {$i < $num_keys} {incr i} {
            r set "key_$i" "value_$i"
        }

        for {set j 0} {$j < 50} {incr j} {
            set expire_keys 100
            for {set i 0} {$i < $expire_keys} {incr i} {
                # Short expiry time to ensure they expire quickly
                r psetex "expire_key_${i}_${j}" 100 "expire_value_${i}_${j}"
            }

            # Verify keys are set
            set initial_size [r dbsize]
            assert_equal $initial_size [expr $num_keys + $expire_keys]
            
            after 150
            for {set i 0} {$i < 10} {incr i} {
                r get "expire_key_${i}_${j}"
                after 10
            }
        }

        set remaining_keys [r dbsize]
        assert_equal $remaining_keys $num_keys

        # Verify server is still responsive
        assert_equal [r ping] {PONG}
    } {}
}
```
Compiling with ASAN using `make noopt SANITIZER=address valkey-server`
and running the test causes error above. Applying the fix resolves the
issue.

Signed-off-by: Yakov Gusakov <[email protected]>
ranshid pushed a commit that referenced this pull request Oct 10, 2025
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
ranshid pushed a commit that referenced this pull request Nov 26, 2025
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
(cherry picked from commit 155b0bb)
Signed-off-by: cherukum-amazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants