Skip to content

Conversation

@ranshid
Copy link
Owner

@ranshid ranshid commented Jul 4, 2025

Note that this continue the work of #3

Overview:

This PR introduces a complete redesign of the 'vset' (stands for volatile set) data structure,
creating an adaptive container for expiring entries. The new design is
memory-efficient, scalable, and dynamically promotes/demotes its internal
representation depending on runtime behavior and volume.

The core concept uses a single tagged pointer (expiry_buckets) that encodes
one of several internal structures:
- NONE (-1): Empty set
- SINGLE (0x1): One entry
- VECTOR (0x2): Sorted vector of entry pointers
- HT (0x4): Hash table for larger buckets with many entries
- RAX (0x6): Radix tree (keyed by aligned expiry timestamps)

This allows the set to grow and shrink seamlessly while optimizing for both
space and performance.

Motivation:

The previous design lacked flexibility in high-churn environments or
workloads with skewed expiry distributions. This redesign enables dynamic
layout adjustment based on the time distribution and volume of the inserted
entries, while maintaining fast expiry checks and minimal memory overhead.

Key Concepts:

  • All pointers stored in the structure must be odd-aligned to preserve
    3 bits for tagging. This is safe with SDS strings (which set the LSB).

  • Buckets evolve automatically:

    • Start as NONE.
    • On first insert → become SINGLE.
    • If another entry with similar expiry → promote to VECTOR.
    • If VECTOR exceeds 127 entries → convert to RAX.
    • If a RAX bucket's vector fills and cannot split → promote to HT.
  • Each vector bucket is kept sorted by entry->getExpiry().

  • Binary search is used for efficient insertion and splitting.

Coarse Buckets Expiration System for Hash Fields

This PR introduces coarse-grained expiration buckets to support per-field
expirations in hash types — a feature known as volatile fields.

It enables scalable expiration tracking by grouping fields into time-aligned
buckets instead of individually tracking exact timestamps.

Motivation

Valkey traditionally supports key-level expiration. However, in many applications,
there's a strong need to expire individual fields within a hash (e.g., session keys,
token caches, etc.).

Tracking these at fine granularity is expensive and potentially unscalable, so
this implementation introduces bucketed expirations to batch expirations together.

Bucket Granularity and Timestamp Handling

  • Each expiration bucket represents a time slice of fixed width (e.g., 8192 ms).
  • Expiring fields are mapped to the end of a time slice (not the floor).
  • This design facilitates:
    • Efficient splitting of large buckets when needed
    • Downgrading buckets when fields permit tighter packing
    • Coalescing during lazy cleanup or memory pressure

Example Calculation

Suppose a field has an expiration time of 1690000123456 ms and the max bucket
interval is 8192 ms:

BUCKET_INTERVAL_MAX = 8192;
expiry = 1690000123456;

bucket_ts = (expiry & ~(BUCKET_INTERVAL_MAX - 1LL)) + BUCKET_INTERVAL_MAX;
          = (1690000123456 & ~8191) + 8192
          = 1690000122880 + 8192
          = 1690000131072

The field is stored in a bucket that ends at 1690000131072 ms.

Bucket Alignment Diagram

Time (ms) →
|----------------|----------------|----------------|
 128ms buckets → 1690000122880    1690000131072
                    ^               ^
                    |               |
              expiry floor     assigned bucket end

Bucket Placement Logic

  • If a suitable bucket already exists (i.e., its end_ts > expiry), the field is added.
  • If no bucket covers the expiry, a new bucket is created at the computed end_ts.

Bucket Downgrade Conditions

Buckets are downgraded to smaller intervals when overpopulated (>127 fields).
This happens when all fields fit into a tighter bucket.

Downgrade rule:

(max_expiry & ~(BUCKET_INTERVAL_MIN - 1LL)) + BUCKET_INTERVAL_MIN < current_bucket_ts

If the above holds, all fields can be moved to a tighter bucket interval.

Downgrade Bucket — Diagram

Before downgrade:

  Current Bucket (8192 ms)
  |----------------------------------------|
  | Field A | Field B | Field C | Field D  |
  | exp=+30 | +200    | +500    | +1500    |
  |----------------------------------------|
                    ↑
       All expiries fall before tighter boundary

After downgrade to 1024 ms:

  New Bucket (1024 ms)
  |------------------|
  | A | B | C | D     |
  |------------------|

Bucket Split Strategy

If downgrade is not possible, the bucket is split:

  • Fields are sorted by expiration time.
  • A subset that fits in an earlier bucket is moved out.
  • Remaining fields stay in the original bucket.

Split Bucket — Diagram

Before split:

  Large Bucket (8192 ms)
  |--------------------------------------------------|
  | A | B | C | D | E | F | G | H | I | J | ... | Z  |
  |---------------- Sorted by expiry ---------------|
            ↑
     Fields A–L can be moved to an earlier bucket

After split:

  Bucket 1 (end=1690000129024)     Bucket 2 (end=1690000131072)
  |------------------------|       |------------------------|
  | A | B | C | ... | L     |       | M | N | O | ... | Z    |
  |------------------------|       |------------------------|

Summary of Bucket Behavior

Scenario Action Taken
No bucket covers expiry New bucket is created
Existing bucket fits Field is added
Bucket overflows (>127 fields) Downgrade or split attempted

API Changes:

Create/Free:
void vsetInit(vset *set);
void vsetClear(vset *set);

Mutation:
bool vsetAddEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetRemoveEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
bool vsetUpdateEntry(vset *set, vsetGetExpiryFunc getExpiry, void *old_entry,
void *new_entry, long long old_expiry,
long long new_expiry);

Expiry Retrieval:
long long vsetEstimatedEarliestExpiry(vset *set, vsetGetExpiryFunc getExpiry);
size_t vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, vsetExpiryFunc expiryFunc, mstime_t now, size_t max_count, void *ctx);

Utilities:
bool vsetIsEmpty(vset *set);
size_t vsetMemUsage(vset *set);

Iteration:
void vsetStart(vset *set, vsetIterator *it);
bool vsetNext(vsetIterator *it, void **entryptr);
void vsetStop(vsetIterator *it);

Entry Requirements:

All entries must conform to the following interface via volatileEntryType:

   sds entryGetKey(const void  entry);         // for deduplication
   long long getExpiry(const void  entry);     // used for bucketing
   int expire(void  db, void  o, void  entry); // used for expiration callbacks

Diagrams:

  1. Tagged Pointer Representation

    Lower 3 bits of expiry_buckets encode bucket type:

    +------------------------------+
    | pointer | TAG (3b) |
    +------------------------------+

    masked via VSET_PTR_MASK

    TAG values:
    0x1 → SINGLE
    0x2 → VECTOR
    0x4 → HT
    0x6 → RAX

  2. Evolution of the Bucket

Volatile set top-level structure:

+--------+     +--------+     +--------+     +--------+
| NONE   | --> | SINGLE | --> | VECTOR | --> |   RAX  |
+--------+     +--------+     +--------+     +--------+

If the top-level element is a RAX, it has child buckets of type:

+--------+     +--------+     +-----------+
| SINGLE | --> | VECTOR | --> | HASHTABLE |
+--------+     +--------+     +-----------+

Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).

  1. RAX Structure with Expiry-Aligned Keys

    Buckets in RAX are indexed by aligned expiry timestamps:

    +------------------------------+
    | RAX key (bucket_ts) → Bucket|
    +------------------------------+
    | 0x00000020 → VECTOR |
    | 0x00000040 → VECTOR |
    | 0x00000060 → HT |
    +------------------------------+

  2. Bucket Splitting (Inside RAX)

    If a vector bucket in a RAX fills:

    • Binary search for best split point.
    • Use getExpiry(entry) + get_bucket_ts() to find transition.
    • Create 2 new buckets and update RAX.

    Original:
    [entry1, entry2, ..., entryN] ← bucket_ts = 64ms

    After split:
    [entry1, ..., entryK] → bucket_ts = 32ms
    [entryK+1, ..., entryN] → bucket_ts = 64ms

    If all entries share same bucket_ts → promote to HT.

  3. Shrinking Behavior

    On deletion:

    • HT may shrink to VECTOR.
    • VECTOR with 1 item → becomes SINGLE.
    • If RAX has only one key left, it’s promoted up.

Summary:

This redesign provides:
✓ Fine-grained memory control
✓ High scalability for bursty TTL data
✓ Fast expiry checks via windowed organization
✓ Minimal overhead for sparse sets
✓ Flexible binary-search-based sorting and bucketing

It also lays the groundwork for future enhancements, including metrics,
prioritized expiry policies, or segmented cleaning.

ranshid added 30 commits June 19, 2025 09:04
Squashed commit of the following:

commit af11752
Author: Ran Shidlansik <[email protected]>
Date:   Thu Jun 19 08:56:12 2025 +0300

    more PR comments

    Signed-off-by: Ran Shidlansik <[email protected]>

commit a39fa20
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 19:35:55 2025 +0300

    update comment

    Signed-off-by: Ran Shidlansik <[email protected]>

commit ec73906
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 19:33:59 2025 +0300

    more pr comments being addressed

    Signed-off-by: Ran Shidlansik <[email protected]>

commit a4aa35c
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 18:51:24 2025 +0300

    move parseExtendedCommandArgumentsOrReply to server.c

    Signed-off-by: Ran Shidlansik <[email protected]>

commit f4a8786
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 18:48:49 2025 +0300

    address some more review comments

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 8ecd584
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 18:12:03 2025 +0300

    add missing entry.o

    Signed-off-by: Ran Shidlansik <[email protected]>

commit ee916d8
Merge: 156c4a5 a1f4cd6
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 17:28:04 2025 +0300

    Merge remote-tracking branch 'origin/unstable' into ttl-poc-new

commit 156c4a5
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 17:10:41 2025 +0300

    address more PR comments

    Signed-off-by: Ran Shidlansik <[email protected]>

commit de675bc
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 18 13:11:48 2025 +0300

    minot review fixes

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 9d39b40
Author: Ran Shidlansik <[email protected]>
Date:   Mon Jun 16 13:10:45 2025 +0300

    Revert " partial work. introduce set expirations"

    This reverts commit 04f2006.

commit 04f2006
Author: Ran Shidlansik <[email protected]>
Date:   Mon Jun 16 13:08:25 2025 +0300

     partial work. introduce set expirations

    Signed-off-by: Ran Shidlansik <[email protected]>

commit cd674be
Author: Ran Shidlansik <[email protected]>
Date:   Sun Jun 15 14:17:05 2025 +0300

    fix misspel in test

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 25954b3
Author: Ran Shidlansik <[email protected]>
Date:   Sun Jun 15 13:35:58 2025 +0300

    fix flakey test with EXPIREAT

    Signed-off-by: Ran Shidlansik <[email protected]>

commit aee670e
Author: Ran Shidlansik <[email protected]>
Date:   Sun Jun 15 11:36:27 2025 +0300

    fix some more memory leaks

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 2fea8e7
Author: Ran Shidlansik <[email protected]>
Date:   Fri Jun 13 15:02:13 2025 +0300

    fix memory leak issue

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 56db999
Author: Ran Shidlansik <[email protected]>
Date:   Thu Jun 12 17:21:51 2025 +0300

    fix bad compilation

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 1b1ce58
Author: Ran Shidlansik <[email protected]>
Date:   Thu Jun 12 17:17:58 2025 +0300

    add missing files

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 7a89a70
Author: Ran Shidlansik <[email protected]>
Date:   Thu Jun 12 17:15:38 2025 +0300

    Separate hash entry implementation

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 23bb4a2
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 19:25:32 2025 +0300

    extend the comment

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 2a5e9a2
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 18:52:53 2025 +0300

    fix merge issues

    Signed-off-by: Ran Shidlansik <[email protected]>

commit e7683b6
Merge: 12151e5 c41ffc3
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 18:52:18 2025 +0300

    Merge remote-tracking branch 'origin/unstable' into ttl-poc-new

commit 12151e5
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 18:47:11 2025 +0300

    fix some bugs and added HPERSIST tests to help schema validator

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 846f943
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 14:06:58 2025 +0300

    fix new commands json and enable silent tests

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 51f9bdc
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 11:40:09 2025 +0300

    better enforce fields number to match the number of provided fields in httl and hpersist commands

    Signed-off-by: Ran Shidlansik <[email protected]>

commit c1cefec
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 10:17:31 2025 +0300

    fix reply schema of commands fetching the hash field ttl

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 4db5b7c
Author: Ran Shidlansik <[email protected]>
Date:   Tue Jun 10 09:47:05 2025 +0300

    fix hexpire flaky test

    Signed-off-by: Ran Shidlansik <[email protected]>

commit f4ae1a2
Author: Ran Shidlansik <[email protected]>
Date:   Mon Jun 9 22:02:39 2025 +0300

    remove fmacros include from volatile_set

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 3190eb4
Merge: 99d25d3 1941d28
Author: Ran Shidlansik <[email protected]>
Date:   Mon Jun 9 21:15:07 2025 +0300

    Merge remote-tracking branch 'origin/unstable' into ttl-poc-new

commit 99d25d3
Author: Ran Shidlansik <[email protected]>
Date:   Mon Jun 9 19:56:14 2025 +0300

    completely remove server level access context 9it was mainly used for lazy expiration logic)

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 2e082db
Author: Ran Shidlansik <[email protected]>
Date:   Thu Jun 5 09:34:16 2025 +0300

    exlude hexpire tests from external tests

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 2c4c312
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 4 15:36:32 2025 +0300

    switch hashtable type only when object has volatile items

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 124acbe
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 4 14:45:19 2025 +0300

    return syntax error when fields is not provided in new API arguments

    Signed-off-by: Ran Shidlansik <[email protected]>

commit dd071f9
Merge: 12a35e2 5699c8c
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 4 13:02:30 2025 +0300

    Merge remote-tracking branch 'origin/unstable' into ttl-poc-new

commit 12a35e2
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 4 13:02:03 2025 +0300

    remove lazy expiration logic and tests

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 0cadaec
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 4 12:27:39 2025 +0300

    copy hash object should also copy the fields ttl

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 79b7e78
Author: Ran Shidlansik <[email protected]>
Date:   Wed Jun 4 12:18:48 2025 +0300

     remove metadata from hash entry

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 8654080
Author: Ran Shidlansik <[email protected]>
Date:   Wed May 21 19:41:59 2025 +0300

    make sure to remove the volatile set on hash object detructor

    Signed-off-by: Ran Shidlansik <[email protected]>

commit eab6fc4
Author: Ran Shidlansik <[email protected]>
Date:   Wed May 21 16:01:56 2025 +0300

    fix trackUpdate condition

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 6d9551e
Author: Ran Shidlansik <[email protected]>
Date:   Wed May 21 15:16:53 2025 +0300

    fix bad memory access issue on entry tracking update

    Signed-off-by: Ran Shidlansik <[email protected]>

commit d719dcd
Author: xbasel <[email protected]>
Date:   Wed May 21 14:50:18 2025 +0300

    Hash TTL - add tests (#1)

    * add tests

    Signed-off-by: xbasel <[email protected]>

    * fix a bug - return on error

    Signed-off-by: xbasel <[email protected]>

    * disable failing tests

    Signed-off-by: xbasel <[email protected]>

    * rmeove redundant test

    Signed-off-by: xbasel <[email protected]>

    * Update tests/unit/hashexpire.tcl

    ---------

    Signed-off-by: xbasel <[email protected]>
    Co-authored-by: Ran Shidlansik <[email protected]>

commit e604b37
Author: Ran Shidlansik <[email protected]>
Date:   Wed May 21 14:43:49 2025 +0300

    make hashtable call entry destructor on delete access

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 0b8dc03
Author: Ran Shidlansik <[email protected]>
Date:   Wed May 21 14:31:44 2025 +0300

    centralize keyspace and key signal notifications to the reset context

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 753ba3c
Author: Ran Shidlansik <[email protected]>
Date:   Tue May 20 14:19:34 2025 +0300

    fix object pass to keyspace notification in HSETEX

    Signed-off-by: Ran Shidlansik <[email protected]>

commit c5b8d76
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 21:47:42 2025 +0300

    fix formatting issue

    Signed-off-by: Ran Shidlansik <[email protected]>

commit e72d7a6
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 21:25:17 2025 +0300

    fix build issues

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 36b7356
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 21:13:15 2025 +0300

    allow setting the key object in context

    Signed-off-by: Ran Shidlansik <[email protected]>

commit a6844ac
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 20:15:18 2025 +0300

    add commands json files

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 20c0d29
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 20:13:23 2025 +0300

    fix hexpire propagation to use hpexpireat

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 5e19c90
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 17:19:41 2025 +0300

    fix HGETEX replication handling

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 31923c5
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 15:46:31 2025 +0300

    make httl functions verify the type

    Signed-off-by: Ran Shidlansik <[email protected]>

commit b782d44
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 15:36:22 2025 +0300

    fix case of hll command issues on non-existing listpack encoded hash

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 0723625
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 15:24:59 2025 +0300

    Fix HEXPIRE parse limits

    Signed-off-by: Ran Shidlansik <[email protected]>

commit d97e23f
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 15:19:53 2025 +0300

    fix FNX/FXX logic

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 4301399
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 15:08:02 2025 +0300

    fix wrong assert condition on update entry

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 6465314
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 15:00:20 2025 +0300

    handle negative ttl correctly

    Signed-off-by: Ran Shidlansik <[email protected]>

commit f62c163
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 14:20:49 2025 +0300

    format fixes

    Signed-off-by: Ran Shidlansik <[email protected]>

commit a59f31a
Author: Ran Shidlansik <[email protected]>
Date:   Mon May 19 14:17:19 2025 +0300

    Add support for HGETEX and HSETEX

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 4a09f3d
Author: Ran Shidlansik <[email protected]>
Date:   Sun May 18 12:16:44 2025 +0300

    free entry when calling hashTypeDelete

    Signed-off-by: Ran Shidlansik <[email protected]>

commit dd62037
Author: Ran Shidlansik <[email protected]>
Date:   Sun May 18 11:12:49 2025 +0300

    remove hashtable redundant log

    Signed-off-by: Ran Shidlansik <[email protected]>

commit ea039c4
Merge: fce9a43 8d686dd
Author: Ran Shidlansik <[email protected]>
Date:   Sun May 18 10:40:39 2025 +0300

    Merge remote-tracking branch 'origin/unstable' into ttl-poc-new

commit fce9a43
Author: Ran Shidlansik <[email protected]>
Date:   Sun May 18 10:39:22 2025 +0300

    fix cmake compilation

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 1f0c933
Author: Ran Shidlansik <[email protected]>
Date:   Sun May 18 10:34:27 2025 +0300

    avoid extra ref count incrementing in hashTypePropagateDeletion

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 6ee497c
Author: Ran Shidlansik <[email protected]>
Date:   Thu May 15 21:32:30 2025 +0300

    fix some more format issues

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 90b7536
Author: Ran Shidlansik <[email protected]>
Date:   Thu May 15 21:30:46 2025 +0300

    fix typo

    Signed-off-by: Ran Shidlansik <[email protected]>

commit fcce92b
Author: Ran Shidlansik <[email protected]>
Date:   Thu May 15 21:28:02 2025 +0300

    fix expire propagation

    Signed-off-by: Ran Shidlansik <[email protected]>

commit cc7c2a3
Author: Ran Shidlansik <[email protected]>
Date:   Thu May 15 20:48:12 2025 +0300

    handle some format check issues

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 61bd39a
Author: Ran Shidlansik <[email protected]>
Date:   Thu May 15 20:45:18 2025 +0300

    fix some spelling checks

    Signed-off-by: Ran Shidlansik <[email protected]>

commit 89f56b0
Author: Ran Shidlansik <[email protected]>
Date:   Thu May 15 20:38:21 2025 +0300

    fix new introduced commands

    Signed-off-by: Ran Shidlansik <[email protected]>

commit ecdcce0
Author: Ran Shidlansik <[email protected]>
Date:   Mon Jan 6 10:46:47 2025 +0200

    Introduce HASH items expiration

    Signed-off-by: Ran Shidlansik <[email protected]>

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
introduce expire.h

Signed-off-by: Ran Shidlansik <[email protected]>
When we are providing an expired timestamp to these commands, the replica will not process
an expired timestamp and we would like to propagate HDEL explicitly

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
2. firx test_entry

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
2. change pointer_vector to pVecotr
3. multiple pr comments fix

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
… entry expiry.

This is the first change in order to reduce the vset default memory consumption.
Although this complicates the API, it allows reducing the memory footprint per
each hash object using the set.

The next potential step is to make the vset a pure bucket pointer so that it will not use any extra memory.

I intentionally separated these changes in order for us to be able to decide if "sacrifice" API friendly is better
than consuming more memory

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
When HSET is called we do make sure to persist the field in case it has expiration.
This should (however) not be done for a volatile field which was NOT expired.

Signed-off-by: Ran Shidlansik <[email protected]>
/* Callback to be optionally provided to vsetPopExpired. when item is removed from the vset this callback will also be applied. */
typedef int (*vsetExpiryFunc)(void *entry, void *ctx);
// vset is just a pointer to a bucket
typedef void *vset;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is much cleaner now. I have a slight suggestion, and feel free to ignore if you want...

Using a typed pointer (not void) provides some additional safety by the compiler. I'm also a little hesitant to create the typedef including the pointer. We do this in sds (char*), and it creates some weirdness in usage - where the caller needs to understand that it's really a pointer anyway.

I suggest:

In the .h file:

typedef struct vset vset;  // This is a typed (not void) pointer (provides some type checking)

bool vsetAddEntry(vset **set, vsetGetExpiryFunc getExpiry, void *entry);

In the .c file:

// You can flesh out the details of your vset struct in the .c file
// (or you can cast to something else, if needed)
struct vset {
    ...
}

In calling code:

vset *mySet;  // Now this looks "normal" to a client

vsetInit(&mySet); // and this will be a consistent pattern

ranshid added a commit that referenced this pull request Aug 5, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5)
[The third PR](#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.
ranshid added a commit that referenced this pull request Aug 5, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5)
[The third PR](#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
@ranshid ranshid closed this Aug 5, 2025
@ranshid
Copy link
Owner Author

ranshid commented Aug 5, 2025

This was manually cherry-picked into valkey-io#2089

ranshid added a commit that referenced this pull request Aug 5, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5)
[The third PR](#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
ranshid added a commit to valkey-io/valkey that referenced this pull request Aug 5, 2025
Closes #640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](ranshid#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](ranshid#5)
[The third PR](ranshid#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
ranshid pushed a commit that referenced this pull request Aug 10, 2025
Test `Instance #5 is still a slave after some time (no failover)` is
supposed to verify that command `CLUSTER FAILOVER` will not promote a
replica without quorum from the primary; later in the file (`Instance 5
is a master after some time`), we verify that `CLUSTER FAILOVER FORCE`
does promote a replica under the same conditions.

There's a couple issues with the tests:

1. `Instance #5 is still a slave after some time (no failover)` should
verify that instance 5 is a replica (i.e. that there's no failover), but
we call `assert {[s -5 role] eq {master}}`.
2. The reason why the above assert works is that we previously send
`DEBUG SLEEP 10` to the primary, which pauses the primary for longer
than the configured 3 seconds for`cluster-node-timeout`.
The primary is marked as failed from the perspective of the rest of the
cluster, so quorum can be established and instance 5 is promoted as
primary.

This commit fixes the two by shortening the sleep to less than 3
seconds, and then asserting the role is still replica. Test `Instance #5
is a master after some time` is updated to sleep for a shorter duration
to ensure that `FAILOVER FORCE` succeeds under the exact same
conditions.

### Testing
`./runtest --single unit/cluster/manual-failover --loop --fastfail`

Signed-off-by: Tyler Amano-Smerling <[email protected]>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](ranshid#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](ranshid#5)
[The third PR](ranshid#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
ranshid pushed a commit that referenced this pull request Sep 15, 2025
## Summary
- extend replication wait time in `slave-selection` test

```
*** [err]: Node valkey-io#10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl
valkey-io#10 didn't became slave of #5
```

## Testing
- `./runtest --single unit/cluster/slave-selection`
- `./runtest --single unit/cluster/slave-selection --valgrind`

Signed-off-by: Vitali Arbuzov <[email protected]>
Signed-off-by: Binbin <[email protected]>
Co-authored-by: Binbin <[email protected]>
Co-authored-by: Harkrishn Patro <[email protected]>
ranshid pushed a commit that referenced this pull request Sep 29, 2025
…alkey-io#2612)

With valkey-io#2604 merged, the `Node valkey-io#10 should eventually replicate node #5`
started passing successfully with valgrind, but I guess we are seeing a
new daily failure from a `New Master down consecutively` test that runs
shortly after.

Signed-off-by: Sarthak Aggarwal <[email protected]>
ranshid pushed a commit that referenced this pull request Sep 30, 2025
…y-io#2257)

**Current state**
During `hashtableScanDefrag`, rehashing is paused to prevent entries
from moving, but the scan callback can still delete entries which
triggers `hashtableShrinkIfNeeded`. For example, the
`expireScanCallback` can delete expired entries.

**Issue**
This can cause the table to be resized and the old memory to be freed
while the scan is still accessing it, resulting in the following memory
access violation:

```
[err]: Sanitizer error: =================================================================
==46774==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000003100 at pc 0x0000004704d3 bp 0x7fffcb062000 sp 0x7fffcb061ff0
READ of size 1 at 0x611000003100 thread T0
    #0 0x4704d2 in isPositionFilled /home/gusakovy/Projects/valkey/src/hashtable.c:422
    #1 0x478b45 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1768
    #2 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    #3 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    #4 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    #5 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    #6 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    #7 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#8 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#9 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#10 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#11 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)
    valkey-io#12 0x452e39 in _start (/local/home/gusakovy/Projects/valkey/src/valkey-server+0x452e39)

0x611000003100 is located 0 bytes inside of 256-byte region [0x611000003100,0x611000003200)
freed by thread T0 here:
    #0 0x7f471a34a1e5 in __interceptor_free (/lib64/libasan.so.4+0xd81e5)
    #1 0x4aefbc in zfree_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:400
    #2 0x4aeff5 in valkey_free /home/gusakovy/Projects/valkey/src/zmalloc.c:415
    #3 0x4707d2 in rehashingCompleted /home/gusakovy/Projects/valkey/src/hashtable.c:456
    #4 0x471b5b in resize /home/gusakovy/Projects/valkey/src/hashtable.c:656
    #5 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #6 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #7 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    valkey-io#8 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#9 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#10 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#11 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#12 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#13 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#14 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#15 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#16 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#17 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#18 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#19 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#20 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#21 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#22 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#23 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#24 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

previously allocated by thread T0 here:
    #0 0x7f471a34a753 in __interceptor_calloc (/lib64/libasan.so.4+0xd8753)
    #1 0x4ae48c in ztrycalloc_usable_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:214
    #2 0x4ae757 in valkey_calloc /home/gusakovy/Projects/valkey/src/zmalloc.c:257
    #3 0x4718fc in resize /home/gusakovy/Projects/valkey/src/hashtable.c:645
    #4 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #5 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #6 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    #7 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#8 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#9 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#10 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#11 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#12 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#13 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#14 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#15 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#16 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#17 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#18 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#19 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#20 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#21 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#22 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#23 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

SUMMARY: AddressSanitizer: heap-use-after-free /home/gusakovy/Projects/valkey/src/hashtable.c:422 in isPositionFilled
Shadow bytes around the buggy address:
  0x0c227fff85d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85f0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c227fff8600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8610: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
=>0x0c227fff8620:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8630: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8640: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8650: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8660: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8670: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==46774==ABORTING
```


**Solution**
Suggested solution is to also pause auto shrinking during
`hashtableScanDefrag`. I noticed that there was already a
`hashtablePauseAutoShrink` method and `pause_auto_shrink` counter, but
it wasn't actually used in `hashtableShrinkIfNeeded` so I fixed that.

**Testing**
I created a simple tcl test that (most of the times) triggers this
error, but it's a little clunky so I didn't add it as part of the PR:

```
start_server {tags {"expire hashtable defrag"}} {
    test {hashtable scan defrag on expiry} {

        r config set hz 100

        set num_keys 20
        for {set i 0} {$i < $num_keys} {incr i} {
            r set "key_$i" "value_$i"
        }

        for {set j 0} {$j < 50} {incr j} {
            set expire_keys 100
            for {set i 0} {$i < $expire_keys} {incr i} {
                # Short expiry time to ensure they expire quickly
                r psetex "expire_key_${i}_${j}" 100 "expire_value_${i}_${j}"
            }

            # Verify keys are set
            set initial_size [r dbsize]
            assert_equal $initial_size [expr $num_keys + $expire_keys]
            
            after 150
            for {set i 0} {$i < 10} {incr i} {
                r get "expire_key_${i}_${j}"
                after 10
            }
        }

        set remaining_keys [r dbsize]
        assert_equal $remaining_keys $num_keys

        # Verify server is still responsive
        assert_equal [r ping] {PONG}
    } {}
}
```
Compiling with ASAN using `make noopt SANITIZER=address valkey-server`
and running the test causes error above. Applying the fix resolves the
issue.

Signed-off-by: Yakov Gusakov <[email protected]>
ranshid pushed a commit that referenced this pull request Oct 10, 2025
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
ranshid pushed a commit that referenced this pull request Oct 10, 2025
…lkey-io#2672)

We have relaxed the `cluster-ping-interval` and `cluster-node-timeout`
so that cluster has enough time to stabilize and propagate changes.

Fixes this test occasional failure when running with valgrind:

    [err]: Node valkey-io#10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl
    valkey-io#10 didn't became slave of #5

Signed-off-by: Sarthak Aggarwal <[email protected]>
ranshid pushed a commit that referenced this pull request Nov 26, 2025
## Summary
- extend replication wait time in `slave-selection` test

```
*** [err]: Node valkey-io#10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl
valkey-io#10 didn't became slave of #5
```

## Testing
- `./runtest --single unit/cluster/slave-selection`
- `./runtest --single unit/cluster/slave-selection --valgrind`

Signed-off-by: Vitali Arbuzov <[email protected]>
Signed-off-by: Binbin <[email protected]>
Co-authored-by: Binbin <[email protected]>
Co-authored-by: Harkrishn Patro <[email protected]>
ranshid pushed a commit that referenced this pull request Nov 26, 2025
…alkey-io#2612)

With valkey-io#2604 merged, the `Node valkey-io#10 should eventually replicate node #5`
started passing successfully with valgrind, but I guess we are seeing a
new daily failure from a `New Master down consecutively` test that runs
shortly after.

Signed-off-by: Sarthak Aggarwal <[email protected]>
ranshid pushed a commit that referenced this pull request Nov 26, 2025
…lkey-io#2672)

We have relaxed the `cluster-ping-interval` and `cluster-node-timeout`
so that cluster has enough time to stabilize and propagate changes.

Fixes this test occasional failure when running with valgrind:

[err]: Node valkey-io#10 should eventually replicate node #5 in
tests/unit/cluster/slave-selection.tcl
    valkey-io#10 didn't became slave of #5

Backported to the 9.0 branch in valkey-io#2731.

Signed-off-by: Sarthak Aggarwal <[email protected]>
ranshid pushed a commit that referenced this pull request Nov 26, 2025
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
(cherry picked from commit 155b0bb)
Signed-off-by: cherukum-amazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants